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Chapter 


Introduction to Statistics 


Leland Wilkinson 


Statistics and state have the same root. Statistics are the numbers of the state. More 
generally, they are any numbers or symbols that formally summarize our observations 
of the world. As we all know, summaries can mislead or elucidate. Statistics also 
refers to the introductory course we all seem to hate in college. When taught well, 
however, it is this course that teaches us how to use numbers to elucidate rather than 
to mislead. 

Statisticians specialize in many areas—probability, exploratory data analysis, 
modeling, social policy, decision making, and others. While they may philosophically 
disagree, statisticians nevertheless recognize at least two fundamental tasks: 
description and inference. Description involves characterizing a batch of data in 
simple but informative ways. Inference involves generalizing from a sample of data 
to a larger population of possible data. Descriptive statistics help us to observe more 
acutely, and inferential statistics help us to formulate and test hypotheses. 

Any distinctions, such as this one between descriptive and inferential statistics, are 
potentially misleading. Let us look at some examples, however, to see some 
differences between these approaches. 


Descriptive Statistics 


Descriptive statistics may be single numerical summaries of a batch, such as an 
average. Or, they may be more complex tables and graphs. What distinguishes 
descriptive statistics is their reference to a given batch of data rather than to a more 
general population or class. While there are exceptions, we usually examine 
descriptive statistics to understand the structure of a batch. A closely related field is 
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called exploratory data analysis. Both exploratory and descriptive methods may lead 
us to formulate laws or test hypotheses, but their focus is on the data at hand. 

Consider, for example, the following batch. These are numbers of arrests by sex in 
1985 for selected crimes in the United States. The source is ће FB7 Uniform Crime 
Reports. What can we say about differences between the patterns of arrests of men and 
women in the United States in 1985? 


CRIME MALES FEMALES 


murder 12904 1815 
rape 28865 303 
robbery 105401 8639 
assault 211228 32926 
burglary 326959 26753 
larceny 744423 334053 
auto 97835 10093 
arson 13129 2003 
battery 416735 75937 
forgery 46286 23181 
fraud 151773 111825 
embezzle 5624 3184 
vandal 181600 20192 
weapons 134210 10970 
vice 29584 67592 
sex 74602 6108 
drugs 562754 90038 
gambling 21995 3879 
family 35553 5086 
dui 1208416 157131 
drunk 726214 70573 
disorderly 435198 99252 
vagrancy 24592 3001 
runaway 53808 72473 


Know Your Batch 


First, we must be careful in characterizing the batch. These statistics do not cover the 
gamut of U.S. crimes. We left out curfew and loitering violations, for example. Not all 
reported crimes are included in these statistics. Some false arrests may be included. 
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State laws vary on the definitions of some of these crimes. Agencies may modify arrest 
statistics for political purposes. Know where your batch came from before you use it. 


Sum, Mean, and Standard Deviation 


Were there more male than female arrests for these crimes in 1985? The following 
output shows us the answer. Males were arrested for 5,649,688 crimes (not 5,649,688 
males—some may have been arrested more than once). Females were arrested 
1,237,007 times. 


i MALES FEMALES 
раа ареала ж<------------------------- 
М of Cases Н 24 24 
Minimum i 5624.000 303.000 
Maximum 1 1208416.000 334053.000 
Sum 1 5649688.000 1237007.000 
Arithmetic Mean | 235403.667 51541.958 
Standard Deviation | 305947.056 74220.864 


How about the average (mean) number of arrests for a crime? For males, this was 
235,403 and for females, 51,542. Does the mean make any sense to you as a summary 
statistic? Another statistic in the table, the standard deviation, measures how much 
these numbers vary around the average. The standard deviation is the square root of the 
average squared deviation of the observations from their mean. It, too, has problems in 
this instance. First of all, both the mean and standard deviation should represent what 
you could observe in your batch, on average: the mean number of fish in a pond, the 
mean number of children in a classroom, the mean number of red blood cells per cubic 
millimeter. Here, we would have to say, “the mean murder-rape-robbery-...-runaway 
type of crime.” Second, even if the mean made sense descriptively, we might question 
its use as a typical crime-arrest statistic. To see why, we need to examine the shape of 
these numbers. 


Stem-and-Leaf Plots 


Let us look at a display that compresses these data a little less drastically. The stem- 
and-leaf plot is like a tally. We pick a most significant digit or digits and tally the next 
digit to the right. By using trailing digits instead of tally marks, we preserve extra digits 
in the data. Notice the shape of the tally. There are mostly smaller numbers of arrests 
and a few crimes (such as larceny and driving under the influence of alcohol) with 
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larger numbers of arrests. Another way of saying this is that the data are positively 
skewed toward larger numbers for both males and females. 
Stem and Leaf Plot of Variable: MALES, N = 24 
Minimum : 5624.000 
Lower Hinge : 29224.500 
Median : 101618.000 
Upper Hinge : 371847.000 
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When data are skewed like this, the mean gets pulled from the center of the majority 
of numbers toward the extreme with the few. A statistic that is not as sensitive to 
extreme values is the median. The median is the value above which half the data fall. 
More precisely, if you sort the data, the median is the middle value or the average of 
the two middle values. Notice that for males the median is 101,618, and for females, 
21,686. Both are considerably smaller than the means and more typical of the majority 
of the numbers. This is why the median is often used for representing skewed data, 
such as incomes, populations, or reaction times. 

We still have the same representativeness problem that we had with the mean, 
however. Even if the medians corresponded to real data values in this batch (which 
they don’t because there is an even number of observations), it would be hard to 
characterize what they would represent. 


1-5 


Sorting 


Introduction to Statistics 


Most people think of means, standard deviations, and medians as the primary 
descriptive statistics. They are useful summary quantities when the observations 
represent values of a single variable. We purposely chose an example where they are 
less appropriate, however, even when they are easily computable. There are better 
ways to reveal the patterns in these data. Let us look at sorting as a way of uncovering 
structure. 

I was talking once with an FBI agent who had helped to uncover the Chicago 
machine’s voting fraud scandal some years ago. He was a statistician, so I was curious 
what statistical methods he used to prove the fraud. He replied, “We sorted the voter 
registration tape alphabetically by last name. Then we looked for duplicate names and 
addresses.” Sorting is one of the most basic and powerful data analysis techniques. The 
stem-and-leaf plot, for example, is a sorted display. 

We can sort on any numerical or character variable. It depends on our goal. We 
began this chapter with a question: Are there differences between the patterns of arrests 
of men and women in the United States in 1985? How about sorting the male and 
female arrests separately? If we do this, we will get a list of crimes in order of 
decreasing frequency within sex. 


MALES FEMALES 
dui larceny 
larceny dui 
drunk fraud 
drugs disorderly 
disorderly drugs 
battery battery 
burglary runaway 
assault drunk 
vandal vice 
fraud assault 
weapons burglary 
robbery forgery 
auto vandal 


sex weapons 


1-6 


Chapter 1 


MALES FEMALES 
runaway auto 
forgery robbery 
family sex 

vice family 
rape gambling 
vagrancy embezzle 
gambling vagrancy 
arson arson 
murder murder 
embezzle rape 


You might want to connect similar crimes with lines. The number of crossings would 
indicate differences in ranks. 


Standardizing 


This ranking is influenced by prevalence. The most frequent crimes occur at the top of 
the list in both groups. Comparisons within crimes are obscured by this influence. Men 
committed almost 100 times as many rapes as women, for example, yet rape is near the 
bottom of both lists. If we are interested in contrasting the sexes on patterns of crime 
while holding prevalence constant, we must standardize the data. There are several 
ways to do this. You may have heard of standardized test scores for aptitude tests. 
These are usually produced by subtracting means and then dividing by standard 
deviations. Another method is simply to divide by row or column totals. For the crime 
data, we will divide by totals within rows (each crime). Doing so gives us the 
proportion of each arresting crime committed by men or women. The total of these two 
proportions will thus be 1. 

Now, a contrast between men and women on this standardized value should reveal 
variations in arrest patterns within crime type. By subtracting the female proportion 
from the male, we will highlight primarily male crimes with positive values and female 
crimes with negative. Next, sort these differences and plot them in a simple graph. The 
following shows the result: 


І-7 


Introduction to Statistics 


о 


-0.5 0.0 0.5 1.0 
Еетаје<----Ргоротоп Difference-—-»Male 


Now we сап see clear contrasts between males and females in arrest patterns. The 
predominantly aggressive crimes appear at the top of the list. Rape now appears where 
it belongs—an aggressive, rather than sexual, crime. A few crimes dominated by 
females are at the bottom. 


Inferential Statistics 


We often want to do more than describe a particular sample. In order to generalize, 
formulate a policy, or test a hypothesis, we need to make an inference. Making an 
inference implies that we think a model describes a more general population from 
which our data have been randomly sampled. Sometimes it is difficult to imagine a 
population from which you have gathered data. A population can be “all possible 
voters,” “all possible replications of this experiment,” or “all possible moviegoers.” 
When you make inferences, you should have a population in mind. 


What is a Population? 


We are going to use inferential methods to estimate the mean age of the unusual 
population contained in the 1980 edition of Who 5 Who in America. We could enter all 
73,500 ages into a SYSTAT file and compute the mean age exactly. If it were practical, 
this would be the preferred method. Sometimes, however, a sampling estimate can be 
more accurate than an entire census. For example, biases are introduced into large 
censuses from refusals to comply, keypunch or coding errors, and other sources. In 
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these cases, а carefully constructed random sample can yield less-biased information 
about the population. 

This is an unusual population because it is contained in a book and is therefore 
finite. We are not about to estimate the mean age of the rich and famous. After all, Spy 
magazine used to have a regular feature listing all of the famous people who are not in 
Who's Who. And bogus listings may escape the careful fact checking of the Who 's Who 
research staff. When we get our estimate, we might be tempted to generalize beyond 
the book, but we would be wrong to do so. For example, if a psychologist measures 
opinions in a random sample from a class of college sophomores, his or her 
conclusions should begin with the statement, “College sophomores at my university 
think...” If the word “people” is substituted for “college sophomores,” it is the 
experimenter’s responsibility to make clear that the sample is representative of the 
larger group on all attributes that might affect the results. 


Picking a Simple Random Sample 


That our population is finite should cause us no problems as long as our sample is much 
smaller than the population. Otherwise, we would have to use special techniques to 
adjust for the bias it would cause. How do we choose a simple random sample from 
a population? We use a method that ensures that every possible sample of a given size 
has an equal chance of being chosen. The following methods are not random: 


m Pick the first name on every tenth page (some names have no chance of being 
chosen). 


m Close your eyes, flip the pages of the book, and point to a name (Tversky and others 
have done research that shows that humans cannot behave randomly). 


m Randomly pick the first letter of the last name and randomly choose from the 


names beginning with that letter (there are more names beginning with C, for 
example, than with I). 


The way to pick randomly from a book, file, or any finite population is to assign a 
number to each name or case and then pick a sample of numbers randomly. You can 
use SYSTAT to generate a random number between 1 and 73,500, for example, with 
the expression: 


1 + INT(73500*URN( )) 
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There are too many pages in Who's Who to use this method, however. As a short cut, I 
randomly generated a page number and picked a name from the page using the random 
number generator. This method should work well provided that each page has 
approximately the same number of names (between 19 and 21 in this case). The sample 
is shown below: 


АСЕ SEX AGE SEX 


60 male 38 female 
74 male 44 male 
39 female 49 male 
78 male 62 male 
66 male 76 female 
63 male 51 male 
45 male 51 male 
56 male 75 male 
65 male 65 female 
51 male 41 male 
52 male 67 male 
59 male 50 male 
67 male 55 male 
48 male 45 male 
36 female 49 male 
34 female 58 male 
68 male 47 male 
50 male 55 male 
51 male 67 male 
47 male 58 male 
81 male 76 male 
56 male 70 male 
49 male 69 male 
58 male 46 male 


58 male 60 male 
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Specifying a Model 


To make an inference about age, we need to construct a model for our population: 
а= р+Е 


This model says that ће аре (a) of someone we pick from the book can be described 
by an overall mean age (и) plus an amount of error (£) specific to that person and due 
to random factors that are too numerous and insignificant to describe systematically. 
Notice that we use Greek letters to denote things that we cannot observe directly and 
Roman letters for those that we do observe. Of the unobservables in the model, н is 
called a parameter, and € a random variable. A parameter is a constant that helps to 
describe a population. Parameters indicate how a model is an instance of a family of 
models for similar populations. A random variable varies like the tossing of a coin. 
There are two more parameters associated with the random variable = but not 

appearing in the model equation. One is its mean (p, ),which we have rigged to be 0, 
and the other is its standard deviation (c, or simply с). Because а is simply the sum 
of u (a constant) and ¢ (a random variable), its standard deviation is also с. 


In specifying this model, we assume the following: 
m The model is true for every member of the population. 


ш The error, plus or minus, that helps determine one population member's age is 
independent of (not predictable from) the error for other members. 


m The errors in predicting all of the ages come from the same random distribution 
with a mean of 0 and a standard deviation of б. 


Estimating a Model 


Because we have not sampled the entire population, we cannot compute the parameter 
values directly from the data. We have only a small sample from a much larger 
population, so we can estimate the parameter values only by using some statistical 
method on our sample data. When our three assumptions are appropriate, the sample 
mean will be a good estimate of the population mean. Without going into all of the 
details, the sample estimate will be, on average, close to the values of the mean in the 
population. 
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We can use various methods in SYSTAT to estimate the mean. One way is to 
specify our model using Linear Regression. Select AGE and add it to the Dependent 
list. With commands: 


REGRESSION 
MODEL AGE=CONSTANT 


This model says that AGE is a function of a constant value (ц ). The rest is error (€ ). 
Another method is to compute the mean from the Basic Statistics routines. The result 
is shown below: 


N of Cases i 50 
Arithmetic Mean ‚ 56.700 
Standard Error of Arithmetic Mean | 1.643 
Standard Deviation į 11.620 


Our best estimate of the mean age of people in Who's Who is 56.7 years. 


Confidence Intervals 


Our estimate seems reasonable, but it is not exactly correct. If we took more samples 
of size 50 and computed estimates, how much would we expect them to vary? First, it 
should be plain without any mathematics to see that the larger our sample, the closer 
will be our sample estimate to the true value of u in the population. After all, if we 
could sample the entire population, the estimates would be the true values. Even so, the 
variation in sample estimates is a function only of the sample size and the variation of 
the ages in the population. It does not depend on the size of the population (number of 
people in the book). Specifically, the standard deviation of the sample mean is the 
standard deviation of the population divided by the square root of the sample size. This 
standard error of the mean is listed on the output above as 1.643. On average, we 
would expect our sample estimates of the mean age to vary by plus or minus a little 
more than one and a half years, assuming samples of size 50. 

If we knew the shape of the sampling distribution of mean age, we would be able to 
complete our description of the accuracy of our estimate. There is an approximation 
that works quite well, however. If the sample size is reasonably large (say, greater than 
25), then the mean of a simple random sample is approximately normally distributed. 
This is true even if the population distribution is not normal, provided the sample size 
is large. 

We now have enough information from our sample to construct a normal 
approximation of the distribution of our sample mean. The following figure shows this 
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approximation to be centered at the sample estimate of 56.7 years. Its standard 
deviation is taken from the standard error of the mean, 1.643 years. 


Mean Age 


We have drawn the graph so that the central area comprises 95% of all the area under 
the curve (from about 53.5 to 59.9). From this normal approximation, we have built a 
95% symmetric confidence interval that gives us a specific idea of the variability of 
our estimate. If we did this entire procedure again—sample 50 names, compute the 
mean and its standard error, and construct a 95% confidence interval using the normal 
approximation—then we would expect that 95 intervals out of a hundred so 
constructed would cover the real population mean age. Remember, population mean 
age is not necessarily at the center of the interval that we just constructed, but we do 
expect the interval to be close to it. 


Hypothesis Testing 


From the sample mean and its standard error, we can also construct hypothesis tests ОП 
the mean. Suppose that someone believed that the average age of those listed in Whos 
Who is 61 years. After all, we might have picked an unusual sample just through the 
luck of the draw. Let us say, for argument, that the population mean age is 61 and the 
standard deviation is 11.62. How likely would it be to find a sample mean age of 56.7? 
If it is very unlikely, then we would reject this null hypothesis that the population mean 
is 61. Otherwise, we would fail to reject it. 

There are several ways to represent an alternative hypothesis against this null 
hypothesis. We could make a simple alternative value of 56.7 years. Usually, however, 
we make the alternative composite—that is, it represents a range of possibilities that 
do not include the value 61. Here is how it would look: 
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Ho: н = 61 (null hypothesis) 
НА: р #61 (alternative hypothesis) 


We would reject the null hypothesis if our sample value for the mean were outside of 
a set of values that a population value of 61 could plausibly generate. In this context, 
“plausible” means more probable than a conventionally agreed upon critical level for 
our test, This value is usually 0.05. A result that would be expected to occur fewer than 
five times in a hundred samples is considered significant and would be a basis for 
rejecting our null hypothesis. 

Constructing this hypothesis test is mathematically equivalent to sliding the normal 
distribution in the above figure to center over 61. We then look at the sample value 56.7 
to see if it is outside of the middle 95% of the area under the curve. If so, we reject the 


null hypothesis. 


The following f test output shows a p value (probability) of 0.012 for this test. Because 
this value is lower than 0.05, we would reject the null hypothesis that the mean age is 
61. This is equivalent to saying that the value of 61 does not appear іп the 95% 
confidence interval. 


One-sample t-test of AGE with 50 Cases à 
Ho: Mean - 61.000 vs Alternative = "пос equal 


Mean : 56.700 
95,00% Confidence Interval : 53.398 to 60.002 
Standard Deviation : 11.620 
t : 72.617 
df 49 
0.012 


p-value 
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The mathematical duality between confidence intervals and hypothesis testing may 
lead you to wonder which is more useful. The answer is that it depends on the context. 
Scientific journals usually follow a hypothesis testing model because their null 
hypothesis value for an experiment is usually 0 and the scientist is attempting to reject 
the hypothesis that nothing happened in the experiment. Any rejection is usually taken 
to be interesting, even when the sample size is so large that even tiny differences from 
0 will be detected. 

Those involved in making decisions—epidemiologists, business people, 
engineers—are often more interested in confidence intervals. They focus on the size 
and credibility of an effect and care less whether it can be distinguished from 0. Some 
statisticians, called Bayesians, go a step further and consider statistical decisions as a 
form of betting. They use sample information to modify prior hypotheses. See Berger 
(1993) and Box and Tiao (1992) for further information on Bayesian statistics. 


Checking Assumptions 


Now that we have finished our analyses, we should check some of the assumptions we 
made in doing them. First, we should examine whether the data look normally 
distributed. Although sample means will tend to be normally distributed even when the 
population isn’t, it helps to have a normally distributed population, especially when we 
do not know the population standard deviation. The stem-and-leaf plot gives us a quick 
idea: 


Stem and Leaf Plot of Variable: AGE, N = 50 
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There is another plot, called a dot histogram (dit) plot which looks like a stem-and-leaf 
plot. We can use different symbols to denote males and females in this plot, however, 
to see if there are differences in these subgroups. Although there are not enough 
females in the sample to be sure of a difference, it is nevertheless a good idea to 
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examine it. The dot histogram reveals four of the six females to be younger than 
everyone else. 


x 
x% % SEX$ 
xŠ% xe X ФЕ 
%%% XX XX X emale 
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ою 4 о ө 7 ю 9 
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A better test of normality is to plot the sorted age values against the corresponding 
values of a mathematical normal distribution. This is called a normal probability plot. 
If the data are normally distributed, then the plotted values should fall approximately 
on a straight line. Our data plot fairly straight. Again, different symbols are used for 
the males and females. The four young females appear in the bottom left corner of the 


plot. 


Sex 


Expected Value for Normal Distribution 


"lo 40 50 60 70 80 90 
AGE 


Does this possible difference in ages by gender invalidate our results? No, but it 
suggests that we might want to examine the gender differences further to see whether 


or not they are significant. 
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2 
Bootstrapping and Sampling 


Leland Wilkinson and Laszlo Engelman 
(revised by Mousum Dutta, Santosh Ranjan, and Anusha Ramakrishnan) 


Resampling (which includes bootstrapping) is not a module in SYSTAT. It is a 
procedure available in most modules where appropriate. Resampling is so important 
as a general statistical methodology, however, that it deserves a separate chapter. In 
SYSTAT, this feature is available as a tab in the dialog box of modules where 
applicable and it offers three resampling techniques: Bootstrap, Without replacement 
sampling, and Jackknife. The computations are handled without producing a scratch 
file of the generated samples. This saves disk space and computer time. Bootstrap, 
jackknife, and other samples are simply computed "on-the-fly". 

SYSTAT provides a summarization based on resampling in Descriptive Statistics, 
Simple Correlations, and Least-Squares Regression. For further details, see the 


respective chapters. 


Statistical Background 


Resampling methods such as bootstrap and jackknife are being widely used in 
obtaining estimates of parameters, point as well as interval, using samples taken from 
unknown probability distributions. Bootstrap (Efron and Tibshirani,1993) is a 
powerful resampling technique. Efron and LePage ( 1992) summarize the problem 
most succinctly. We have a set of real-valued observations x}, x», ..., x, independently 
sampled from an unknown probability distribution F. We are interested in estimating 
some parameter 0 by using the information in the sample data with an estimator 

6 = t(x). Some measure of the estimate’s accuracy is as important as the estimate 
itself; we want a standard error of Ô and, even better, a confidence interval on the 


true value 0. 
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Classical statistical methods provide a powerful way of handling this problem when F 
is known and Ө is simple—when Ө , for example, is the mean of the normal 
distribution. Focusing on the standard error of the mean, we have: 


2 
se{x;F} = a 


Substituting the unbiased estimate for о (Р), 


n 


(x, - x! 
^2 IV 
с (Ғу- DIET D 
we have: 
se(x) = 


Parametric methods often work fairly well even when the distribution is contaminated. 
or only approximately known, because the central limit theorem shows that the sum of 
independent random variables with finite variances tends to be normal in large samples 
even when the variables themselves are not normal. But problems arise for estimates 
more complicated than a mean—medians, sample correlation coefficients, or 
eigenvalues, especially in small or medium-sized samples and even, in some cases, in 
large samples. 

Strategies for approaching this problem "nonparametrically" have involved using 
the empirical distribution / to obtain the information needed for the standard error 
estimate. One approach is Tukey's jackknife (Tukey, 1958), which is offered in 
SAMPLE-JACKKNIFE. Tukey proposed computing n subsets of (хі, x», ..., Ху), each 
consisting of all of the cases except the ith deleted case (for i — 1, ... , n). He produced 
standard errors as a function of the estimates from these subsets. 

Another approach has involved subsampling, usually via simple random samples. 
This option is offered in SAMPLE-SIMPLE. A variety of researchers in the 1950's and 
1960's explored these methods empirically (for example, Block, 1960; see Noreen, 
1989, for others). This method amounts to a Monte Carlo study in which the sample is 
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treated as the population. It is also closely related to the methodology for permutation 
tests (Fisher, 1935; Dwass, 1957; Edgington, 1995). 
The bootstrap (Efron, 1979) has been the focus of most recent theoretical research. 
Ё is defined as: 


Ê : probability 1/n on x, for i= 1,2, ...n, 


Then, since 


n 


The computer algorithm for getting the samples for generating F is to sample from 
(пр Ху) with replacement. Efron and other researchers have shown that the general 
procedure of generating samples and computing estimates 6 yields“ 6 data” on 
which we can make useful inferences. For example, instead of computing only 6 
and its standard error, we can do histograms, densities, order statistics (for symmetric 
and asymmetric confidence intervals), and other computations on our estimates. In 
other words, there is much to learn from the bootstrap sample distributions of the 
estimates themselves. 

There are some concerns, however. The naive bootstrap computed this way (with 
SAMPLE=BOOT and CSTATISTICS for computing means and standard deviations) is 
not especially good for long-tailed distributions. It is also not suited for time-series or 
stochastic data. See LePage and Billard (1992) for research on solutions to some of 
these problems. There are also several simple improvements to the naive boostrap. One 
is the pivot, or bootstrap-/ method, discussed in Efron and Tibshirani (1993). This is 
especially useful for confidence intervals on the mean of an unknown distribution. 
Efron (1982) discusses other applications. There are also refinements based on 
correction for bias in the bootstrap sample itself (DiCiccio and Efron, 1996). 

In general, however, the naive bootstrap can help you get better estimates of 
standard errors and confidence intervals than many large-sample approximations, such 
as Fisher's z transformation for Pearson correlations or Wald tests for coefficients in 
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nonlinear models. And in cases in which no good approximations are available (see 
some of the examples below), the bootstrap is the only way to go. For more 
information, see Chernick (1999), Davison and Hinkley (1997), Good (2004), and 
Lunneborg (2000). 


Bootstrap estimates can be used for obtaining confidence intervals. There are two 
popular methods to obtain bootstrap-based confidence intervals: 


Percentile method. In this method, the empirical percentiles of the bootstrap 
distribution are used to get confidence intervals of the intended coverage for the 
parameter, say Ө . It applies to both parametric and nonparametric methods. This 
method is transformation-respecting, i.e., the percentile interval for any monotone 
transformation y =h(0) is the percentile interval for Ө mapped by h(0). The 
confidence limits obtained by using this method are within the allowable range of 
the parameter. But it does not work well if the number of bootstrap samples is not 
sufficiently large or the sampling distribution is not symmetric. 


Bias corrected and accelerated method (BCa method). In this method, the 
percentile confidence limits are modified, by taking into account the bias in the 
bootstrap sampling distribution and the tendency of the standard error to vary with 
6. The value for bias correction is obtained by using the estimates from the 
bootstrap samples and a measure of acceleration is obtained by using jackknife 
estimates. Confidence intervals obtained by using the BCa method are 
transformation-respecting, range preserving, and have high accuracy. 


For more details refer to Efron and Tibshirani (1993). 
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Resampling in SYSTAT 
Resampling Tab 


Resampling appears as a tab in the dialog boxes of all modules where this feature is 
available. For example, in the Analysis of Variance feature, the Resampling tab appears 
as follows: 


FA Analysis of Variance: Estimate Model 


Í Model | Repeated Measures] Options] Resampling 
[Г] Perform resampling 
не (ШЕШЕН 
Number of samples 


Sample size: 


Random seed 


а-ы 


Perform resampling. Generates samples of cases and uses data thereof to carry out the 
same analysis on each sample. 

Method. Three resampling methods are available: 

= Bootstrap. Generates bootstrap samples. This is the default method. 


m Without replacement. Generates subsamples without replacement. 


~ 
$ 


---.02Х..4. 


wm TIL: 
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m Jackknife. Generates jackknife samples. 


Number of samples. Specify the number of samples to be generated. These samples are 
analyzed using the chosen method of sampling.The default is 1. 


Sample size. Specify the size of each sample to be generated while resampling. The 
default sample size is the number of cases in the data file in use. 


Random seed. Specify a random seed to be used while resampling. The default random 
seed is generated by the system. 


SYSTAT gives a summary based on resampling in Descriptive Statistics, Simple 
Correlations, and Least-Squares Regression. For further details, see the respective 
chapters. 


Using Commands 


The syntax is: 


ESTIMATE / SAMPLE=BOOT (m,n) or, 
SIMPLE(m,n) or, 
JACK 


The arguments m and п stand for the number of samples and the sample size of each 
sample. The parameter n is optional and defaults to the number of cases in the file. 


2 a The BOOT option generates samples with replacement, SIMPLE generates samples 


without replacement, and JACK generates a jackknife set. 
For additional commands in Basic Statistics, CORR and REGRESS modules, see the 
specific chapter. 


Usage Considerations 


Types of data. Resampling works on procedures with rectangular data only. It can be 
performed when case selection is not in effect. 


Print options. It is best to set PLENGTH NONE; otherwise, you will get 16 miles of 
output. If you want to watch, however, set PLENGTH LONG and have some fun. 


Quick Graphs. Resampling produces Quick Graphs for modules where summarization 
based on resampling is provided. The histograms of different resampled statistics are 
plotted as Qujgk-Graphs. In "Descriptive Statistics' the selected statistics among Mean, 


1-23 
Bootstrapping and Sampling 


Median, SD, Variance, Skewness, and Kurtosis are plotted. In ‘Simple Correlations’, 
the correlation coefficients are plotted. In 'Least-Squares Regression’, the regression 
coefficients are plotted. 


Saving files. If you are doing this for more than entertainment (watching output fly by), 
save your data into a file before you use the ESTIMATE / SAMPLE command. See the 
examples. 


BY groups. By groups analysis is not available in resampling. 
Case frequencies. The analysis ignores any frequency variable specifications. 


Case weights. Use case weighting if it is available in a specific module. 


Examples 


A few examples will serve to illustrate resampling. They cover only a few of the 
statistical modules, however. We will focus on the tools you can use to manipulate 
output and get the summary statistics you need for resampling estimates. 


Example 1 
Least-Squares Regression 


This example involves the LONGLEY regression data. These real data were collected 
by James Longley (1967) at the Bureau of Labor Statistics to test the limits of 
regression software. The predictor variables in the data set are highly collinear, and 
several coefficients of variation are extremely large. 


The input is: 


REGRESS 
USE LONGLEY 
SAMPLE BOOT(2500,16) / CONFI=0.95 
MODEL TOTAL = CONSTANT + DEFLATOR + GNP + UNEMPLOY, 


+ ARMFORCE + POPULATN +TIME 
ESTIMATE 


Notice that we request bootstrap analysis with 2500 samples each of size 16 (number 


cases in the file). We also have opted for 95% confidence interval for the regression 


coefficients. 
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The output is: 

Analysis for the Given Data 

Dependent Variable i сайы 
N i 
Multiple R 1 0.998 
Squared Multiple R 1 0.995 
Adjusted Squared Multiple КО! 0.992 
Standard Error of Estimate ! 304.854 


Regression Coefficients B = wx xy 


H Std. 

Effect | Coefficient -Standard Error Coefficient Tolerance t 
RET OOE аддала А оаа a Ten 3.911 
CONSTANT | -3482258.635 890420.384 0.000 . 

DEFLATOR | 15.062 84.915 0.046 0.007 

GNP 1 -0.036 0.033 -1.014 0.001 

UNEMPLOY | -2.020 0.488 -0.538 0.030 

ARMFORCE | -1.033 0.214 -0.205 0.279 

POPULATN | -0.051 0.226 -0.101 0.003 

TIME H 1829.151 455.478 2.480 0.001 


Regression Coefficients B — (x'x)?x'Y (contd...) 


i 
Effect | p-value 
=== орот 
CONSTANT | 0.004 
DEFLATOR | 0.863 
GNP i 0.313 
UNEMPLOY | 0.003 
ARMFORCE ; 0.001 
POPULATN | 0.826 
TIME 1 0.003 


Analysis of Variance 


Source 1 55 df Mean Squares F-ratio p-value 


Regression | 1.842Е%008 6 30695400.324 330.285 0.000 
Residual 1 836424.056 9 92936.006 


Bootstrap Summary 


Number of Samples | 2500 
Size of Each Sample | 16 


Bootstrap Estimates of the Regression Coefficients, Bias and Standard Error 


Н Bootstrap 
Effect 1 Estimate Bias Standard Error 
coole ОМА соно RUE ERE аа рате m 
CONSTANT | -3815060.832 -332802.197 1810921.080 
DEFLATOR 13.834 -1.228 442.345 
GNP -0.046 -0.010 0.165 
UNEMPLOY -2.169 -0.149 2.468 


1.200 1.251 59.624 


i 
i 
i 
ARMFORCE | -1.081 -0.048 0.834 
i 
! 1983.411 154.260 1189.496 
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95.0% Confidence Interval for the Regression Coefficients 


Effect H Percentile Method BCa Method 

і Оррег Оррег 
CONSTANT -1020334.70 223930.833 
DEFLATOR | 308.644 350.212 
GNP 0.036 0.096 
UNEMPLOY -1.040 -0.093 
ARMFORCE -0.524 -0.192 
POPULATN 0.817 0.488 
TIME 3979.198 3299.108 

Histogram of the Estimates of Coefficient 
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The bootstrapped standard errors are all larger than the normal-theory standard errors. 
It is well known that multicollinearity leads to large standard errors for regression 
coefficients, but the bootstrap makes this even clearer. 

Notice that all the intervals are asymmetric. From these we get an idea that the 
coefficients are not normally distributed. 

From the histograms it is clear that the coefficients are not normally distributed. We 
have run a relatively large number of samples (2500) to reveal these long-tailed 
distributions. Were these data to be analyzed formally, it would take a huge number of 
samples to get useful standard errors. 

Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this 
problem. They added a uniform random extra digit to Longley's data so that their data 
sets rounded to Longley's values and found in a simulation that the variance of the 
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simulated coefficient estimates was larger in many cases than the miscalculated 
solutions from the poorer designed regression programs. 


Example 2 
Spearman Rank Correlation 


This example involves LAW school data from Efron and Tibshirani (1993). They use 
these data to illustrate the usefulness of the bootstrap for calculating standard errors on 


the Pearson correlation estimates. 
Here we use bootstrap to find the standard error and a 95% confidence interval of 


the Spearman correlation. 


The input is: 


CORR 
USE LAW 
SAMPLE BOOT (1000,15) /CONFI=0.95 
SPEARMAN LSAT GPA 


The output is: 
Analysis for the Given Data 
Number of Observations: 15 


Spearman Correlation Matrix 


LSAT | 1.000 
GPA | 0.796 1.000 


Bootstrap Summary 


Number of Samples + 1000 
Size of Each Sample | 15 


Bootstrap Estimates of Spearman Correlation 


LSAT | 1.000 
GPA | 0.766 1.000 


Matrix of Bias in Spearman Correlation 


LSAT | 0.000 
GPA ! -0.030 0.000 
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Matrix of Standard Error of Spearman Correlation 


LSAT | 0.000 
GPA | 0.125 0.000 


95.0% Confidence Interval for Spearman Correlation 
Variables Variables | Percentile Method BCa Method 
H Lower Upper Lower Upper 


mo %-------------------------------------- 
LSAT GPA i 0.458 0.960 0.479 0.964 


Histogram of the Estimates of Correlation Coefficient 
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Correlation(LSAT,GPA) 
The histogram of the entire file shows the overall shape of the distribution. Notice its 
asymmetry. 
Example 3 


Confidence Intervals for Mean and Median 


In this example we use OURWORLD data, which consists of demographic data across 
different countries. 

We will use the CSTATISTICS command to compute summarized bootstrap 
estimates for the mean, median for the variable LIFE EXP, which gives years of 
average life expectancy. 
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The input is: 


USE OURWORLD 

EXIT 

SAMPLE BOOT (1000,57) / MEAN MEDIAN CONFI = 0.95 
CSTATISTICS LIFE_EXP 


The output is: 

Bootstrap Summary 
Number of Samples 1 1000 
Size of Each Sample | 57 


Estimate of Mean 


Bootstrap Estimate from Bias Standard Error 
Estimate Original Data 


63.719 0.014 1.550 


Variable 


95.0% Confidence Interval for Mean 


Variable | Percentile Method BCa Method 

H Lower Upper Lower Upper 
ee Seen НЕ a Pe 
LIFE_EXP | 60.632 66.754 60.404 66.649 


Histogram of the Estimates of Mean 


Count 


60 6 
Mean of LIFE_EXP 
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From the above histogram, the distribution of the mean appears to be close to normal. 
Hence the bootstrap estimates are similar to the original estimates. 


Estimate of Median 


Variable | Bootstrap Estimate from Bias Standard Error 
| Estimate Original Data 


LIFE EXP | 68.668 70.000 -1:332 2.307 


95.0% Confidence Interval for Median 


Variable | Percentile Method BCa Method 

Н Lower Upper Lower Upper 
— À %---------------------------------------- 
LIFE_EXP | 63.000 71.000 63.000 71.000 


Histogram of the Estimates of Median 
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Median of LIFE. EXP 
It can be seen that medians from different samples are mainly concentrated at 70 which 
is the median of the original sample. 


Example 4 
Usefulness of Jackknife estimate 


Jackknife estimates are often used to estimate and reduce bias. In many cases 
maximum likelihood estimates are biased. Also, in many cases, there does not exist any 
unbiased estimate of the parameter of interest. 
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For example, if X~ Exponential distribution with mean À. , then the maximum 
likelihood estimate of д = x is given byO = x But this is a biased estimator. 


In the following example, we illustrate the use of the jackknife method to estimate 
0 for the above case. 


For this, we draw a random sample of size 50 from the exponential distribution with 
). —0.4, and save in a file called USEFULNESS. 


RANDSAMP 

SAVE USEFULNESS 

UNIVARIATE ERN(0, 0.4) /SIZE - 50 NSAMP - 1 
EXIT 


Note that the population value of Ө is 2.5. 


Now we use jackknife method to find the mean of the data set and save the resampled 
means in a data file called JACK1. 


The input is: 


USE USEFULNESS 
SSAVE JACK1 / AG 
SAMPLE JACK / MEAN 
CSTATISTICS 51 


The output is: 
Jackknife Summary 


Number of Samples } 50 
Size of Each Sample | 49 


Estimate of Mean 
Variable | Jacknife Estimate from Bias Standard Error 
| Estimate Original Data 
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Histogram of the Estimates of Mean 
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Note that the estimate of bias is zero. But this is for 7, not for 0 . The estimate of 
0 from the dataset USEFULNESS is 2.475. 


Now we find the estimate of 0 from each jackknife sample by recalling the data set 
ЈАСКЛ. 


The input is: 
USE JACK1 
DEFVAR THETA/ Type = Number, Display = 12.3 
LET THETA = 1/S1 
Then we find the mean of the column THETA with its standard error. 


CSTATISTICS THETA / MEAN SD 


The output is: 


Arithmetic Mean 1 2.474 
Standard Deviation | 0.057 


Note that both the original and jackknife estimates are close. Also, by using the 
jackknife method, we obtain a standard error of the estimate, which is difficult to get 
otherwise. 
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Example 5 
Canonical Correlations: Using Text Output 


Most statistics can be bootstrapped by saving into SYSTAT files, as shown in the 
examples. Sometimes you may want to search through bootstrap output for a single 
number and compute standard errors or graphs for that statistic. The following example 
uses SETCOR to compute the distribution of the two canonical correlations relating the 
species to measurements in the Fisher IRIS data. The same correlations are computed 
in the DISCRIM procedure. 


The input is: 


SETCOR 

PLENGTH LONG 

USE IRIS 

MODEL SPECIES=SEPALLEN. . PETALWID 
CATEGORY SPECIES 

PUSH CLASSIC 

CLASSIC ON 

OUTPUT '&OUTPUT\TEMP' 

ESTIMATE / SAMPLE=BOOT (500,150) 
OUTPUT 

POP CLASSIC 


GET '&OUTPUTNTEMP' 
INPUT A$,B$,C$,D$ 


LET Е1=. 

LET R2=. 

LET FOUND=. ‹ 

1Е А$="Сапоп1са1" АМР B$-"Correlations" THEN LET FOUND-CASE 
IF LAG(FOUND) == - THEN DELETE 


LET R1-VAL(B$) 
LET R2-VAL(C$) 


DSAVE CC 


USE CC 
DROP A$ 
DENSITY R1 R2 / DIT 


Notice how the program searches through the output file TEMP.DAT for the words 
Canonical Correlations at the beginning of a line. In the next line, the actual numbers 
are in the output, so we use the LAG function to check when we are at that point after 
having located the string. Then we convert the printed values back to numbers with the 
VAL() function. If you are concerned with precision, use a larger format for the output. 
We delete unwanted rows and save the results into the file CC. From that file, we plot 
the two canonical correlations. For fun, we do a dot histogram (dit) plot. 
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The output is: 


0.975 0.980 0.985 0.990 0.995 
R1 


Notice the stripes in the plot on the left. These reveal the three digit rounding we 
incurred by using the standard FORMAT 3. 


Example 6 
POSAC: Proportion of Profile Pairs Correctly Represented 


This bootstrap example corresponds to Multiple Categories on page 390 in the POSAC 
chapter (Example 3: Chapter 10 in Statistics III). Here POSAC uses the crime data to 
construct a 2D solution of crime patterns. We first recode the data into four categories 
for each item by using the CUT function. The cuts are made at each standard deviation 
and the mean. Then, POSAC computes the coordinates for these four category profiles. 
The main objective of this bootstrap example is to study the distribution of the 
proportion of profile pairs correctly represented. Here we use 1000 resamples of size 
50 (sample size in original data) each and find the 95% confidence interval and plot the 
histogram for this proportion of profile pairs correctly represented. 
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The input is: 


USE CRIME 

STANDARDIZE MURDER. .AUTOTHFT 
LET (MURDER. .AUTOTHFT) =CUT(@,~-1,0,1,4) 
POSAC 

MODEL MURDER. .AUTOTHFT 

PUSH CLASSIC 

CLASSIC ON 

OUTPUT '&OUTPUT\TEMP' 

ESTIMATE / SAMPLE=BOOT (1000, 50) 
OUTPUT 

POP CLASSIC 


СЕТ '&OUTPUT\TEMP' 

INPUT А5 B$ C$ D$ ES F$ G$ H$ \ 
IF A$ <> "Proportion" THEN DELETE 
LET CORRECT=. 

LET CORRECT=VAL (H$) 

DSAVE MD 


USE MD 


SORT CORRECT ) Му) 
IF CASE=25 THEN PRINT "95.0% Lower Confidence Limit: ",CORRECT 


IF CASE=975 THEN PRINT "95.0% Upper Confidence Limit: ", CORRECT 


DENSITY CORRECT 
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The output is: 
95.0% Lower Confidence Limit: 0.729 
95.0% Upper Confidence Limit: 0.905 
The following is the histogram of the bootstrap sample proportion of profile pairs 
correctly represented: 
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Example 7 


Nonparametric: One Sample Kolmogorov-Smirnov Test Statistic 


The file MINTEMP contains the annual minimum temperature (F) of Plymouth (in 
Britain) for 49 years (1916-1964). Barnett and Lewis (1967) fitted a Gumbel 
distribution to the data. Estimates of location and scale parameters are 23.374 and 
2.959 respectively. The Kolmogorov-Smirnov test statistic is 0.153 with a p-value (2- 
tail) of 0.200 (refer to example 4 of Fitting Distributions, chapter 13 in Statistics I). The 
main objective of this bootstrap example is to obtain an approximation to the sampling 
distribution of the statistic. Here we use 1000 resamples of size 49 and find the p-value 
for this observed statistic (0.153). For this example, we first compute the test statistic 
for the bootstrap samples from the distribution relevant to the null hypothesis (Gumbel 
with parameters 23.374, 2.959); then we compute the p-value as the proportion of test 
statistic values greater than or equal to the observed value of the test statistic. For more 
details, refer Davison and Hinkley (1997). 
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The input is: 


USE 

RSEED 7298 

REPEAT 49 

LET GUMBEL- GURN(23.374,2.959) 


NPAR 

PUSH CLASSIC 

CLASSIC ON 

OUTPUT '&OUTPUT\TEXT' 

KS GUMBEL/GUM-23.374,2.959 SAMPLE-BOOT (1000,49) 
OUTPUT 

POP CLASSIC 


GET '&OUTPUTNTEXT' 

INPUT A$ VARS B$ М MAXDIF 

SELECT (VAR$ = 'GUMBEL') 

HIST MAXDIF / BARS-25 

IF (MAXDIF>=0.153) THEN LET P VALUE-1/1000 
CSTATISTICS P VALUE / SUM 


The output is: 


P VALUE 
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The following is the histogram of the bootstrap sample K-S test statistics: 
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The p-value from this histogram is 0.313. From the K-S test the p-value is 0.200. 


Computation 


Computations are done by the respective statistical modules. Sampling is done on the 
data. 


Algorithms 


Bootstrapping and other sampling is implemented via a one-pass algorithm that does 
not use extra storage for the data. Samples are generated using the SYSTAT uniform 
random number generator. It is always а good idea to reset the seed when running а 
problem so that you can be certain where the random number generator started if it 
becomes necessary to replicate your results. 


Missing Data 


Cases with missing data are handled by the specific modules. 
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3 
Classification and Regression Trees 


Leland Wilkinson 


The TREES module computes classification and regression trees. Classification trees 
include those models in which the dependent variable (the predicted variable) is 
categorical. Regression trees include those in which it is continuous. Within these 
types of trees, the TREES module can use categorical or continuous predictors, 
depending on whether a CATEGORY statement includes some or all of the predictors. 

For any of the models, a variety of loss functions is available. Each loss function 
is expressed in terms of a goodness-of-fit statistic—the proportion of reduction in 
error (PRE). For regression trees, this statistic is equivalent to the multiple R2, Other 
loss functions include the Gini index, “twoing” (Breiman et al.,1984), and the phi 
coefficient. 

TREES produces graphical trees called mobiles (Wilkinson, 1995). At the end of 
each branch is a density display (box plot, dot plot, histogram, etc.) showing the 
distribution of observations at that point. The branches balance (like a Calder mobile) 
at each node so that the branch is level, given the number of observations at each end. 
The physical analogy is most obvious for dot plots, in which the stacks of dots (one 
for each observation) balance like marbles in bins. 

TREES can also produce a SYSTAT program to code new observations and predict 
the dependent variable. This program can be saved to a file and run from the command 
window or submitted as a program file. 

Resampling procedures are available in this feature. 
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Statistical Background 


Trees are directed graphs beginning with one node and branching to many. They are 
fundamental to computer science (data structures), biology (classification), 
psychology (decision theory), and many other fields. Classification and regression 
trees are used for prediction. In the last two decades, they have become popular as 
alternatives to regression, discriminant analysis, and other procedures based on 
algebraic models. Tree-fitting methods have become so popular that several 
commercial programs now compete for the attention of market researchers and others 
looking for software. 

Different commercial programs produce different results with the same data, 
however. Worse, some programs provide no documentation or supporting material to 
explain their algorithms, The result is a marketplace of competing claims, jargon, and 
misrepresentation. Reviews of these packages (for example, Levine, 1991; Simon, 
1991) use words like “sorcerer,” “magic formula,” and “wizardry” to describe the 
algorithms and express frustration at vendors’ scant documentation. Some vendors, in 
turn, have represented tree programs as state-of-the-art “artificial intelligence” 
procedures capable of discovering hidden relationships and structures in databases. 

Despite the marketing hyperbole, most of the now-popular tree-fitting al gorithms 
have been around for decades. The modern commercial packages are mainly 
microcomputer ports (with attractive interfaces) of the mainframe programs that 
originally implemented these algorithms. Warnings of abuse of these techniques are 
not new either (for example, Einhorn, 1972; Bishop et al., 1975). Originally proposed 
as automatic procedures for detecting interactions among variables, tree-fitting 
methods are actually closely related to classical cluster analysis (Hartigan, 1975). 

This introduction will attempt to sort out some of the differences between 
algorithms and illustrate their use on real data. In addition, tree analyses will be 
compared to discriminant analysis and regression. 


The Basic Tree Model 


The figure below shows a tree for predicting decisions by a medical school admissions 
committee (Milstein et al., 1975). It was based on data for a sample of 727 applicants. 
We selected a tree procedure for this analysis because it was easy to present the results 
to the Yale Medical School admissions committee and because the tree model could 
serve as a basis for structuring their discussions about admissions policy. 
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Notice that the values of the predicted variable (the committee’s decision to reject 
or interview) are at the bottom of the tree and the predictors (Medical College 
Admissions Test and college grade point average) come into the system at each node 
of the tree. 

The top node contains the entire sample. Each remaining node contains a subset of 
the sample in the node directly above it. Furthermore, each node contains the sum of 
the samples in the nodes connected to and directly below it. The tree thus splits 
samples. 

Each node can be thought of as a cluster of objects, or cases, that is to be split by 
further branches in the tree. The numbers in parentheses below the terminal nodes 
show how many cases are incorrectly classified by the tree. А similar tree data structure 
is used for representing the results of single and complete linkage and other forms of 
hierarchical cluster analysis (Hartigan, 1975). Tree prediction models add two 
ingredients: the predictor and predicted variables labeling the nodes and branches. 
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The tree is binary because each node is split into only two subsamples. Classification 
or regression trees do not have to be binary, but most are. Despite the marketing claims 
of some vendors, nonbinary, or multibranch, trees are not superior to binary trees. Each 
is a permutation of the other, as shown in the figure below. 

The tree on the left (ternary) is not more parsimonious than that on the right (binary). 
Both trees have the same number of parameters, or split points, and any statistics 
associated with the tree on the left can be converted trivially to fit the one on the right. 
A computer program for scoring either tree (IF ... THEN ... ELSE) would look identical. 
For display purposes, it is often convenient to collapse binary trees into multibranch 


trees, but this is not necessary. 


1-44 


Chapter 3 


2 
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Some programs that do multibranch splits do not allow further splitting on a predictor 
once it has been used. This has an appealing simplicity. However, it can lead to 
unparsimonious trees. It is unnecessary to make this restriction before fitting a tree. 

The figure below shows an example of this problem. The upper right tree classifies 
objects on an attribute by splitting once on shape, once on fill, and again on shape. This 
allows the algorithm to separate the objects into only four terminal nodes having 
common values. The upper left tree splits on shape and then only on fill. By not 
allowing any other splits on shape, the tree requires five terminal nodes to classify 
correctly, This problem cannot be solved by splitting first on fill, as the lower left tree 
shows. In general, restricting splits to only one branch for each predictor results in 
more terminal nodes. 


N fill 
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Categorical or Quantitative Predictors 


The predictor variables in the figure on page 43 are quantitative, so splits are created 
by determining cut points on a scale. If predictor variables are categorical, as in the 
figure above, splits are made between categorical values. It is not necessary to 
categorize predictors before computing trees. This is as dubious a practice as recoding 
data well-suited for regression into categories in order to use chi-square tests. Those 
who recommend this practice are turning silk purses into sows’ ears. In fact, if 
variables are categorized before doing tree computations, then poorer fits are likely to 
result. Algorithms are available for mixed quantitative and categorical predictors, 
analogous to analysis of covariance. 


Regression Trees 


Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a 
quantitative variable. They called the method Automatic Interaction Detection 
(AID). The algorithm performs stepwise splitting. It begins with a single cluster of 
cases and searches a candidate set of predictor variables for a way to split the cluster 
into two clusters. Each predictor is tested for splitting as follows: sort all the л cases on 
the predictor and examine all п — 1 ways to split the cluster in two. For each possible 
split, compute the within-cluster sum of squares about the mean of the cluster on the 
dependent variable. Choose the best of the n — 1 splits to represent the predictor's 
contribution. Now do this for every other predictor. For the actual split, choose the 
predictor and its cut point that yields the smallest overall within-cluster sum of squares. 

Categorical predictors require a different approach. Since categories are unordered, 
all possible splits between categories must be considered. For deciding on one split of 
k categories into two groups, this means that 21-1 possible splits must be considered. 
Once a split is found, its suitability is measured on the same within-cluster sum of 
squares as for a quantitative predictor. 

Morgan and Sonquist called their algorithm AID because it naturally incorporates 
interaction among predictors. Interaction is not correlation. It has to do, instead, with 
conditional discrepancies. In the analysis of variance, interaction means that a trend 
within one level of a variable is not parallel to a trend within another level of the same 
variable. In the ANOVA model, interaction is represented by cross-products between 
predictors. In the tree model, it is represented by branches from the same node that 
have different splitting predictors further down the tree. 
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The figure below shows a tree without interactions on the left and with interactions 
on the right. Because interaction trees are a natural by-product of the AID splitting 
algorithm, Morgan and Sonquist called the procedure “automatic.” In fact, AID trees 
without interactions are quite rare for real data, so the procedure is indeed automatic. 
To search for interactions using stepwise regression or ANOVA linear modeling, we 
would have to generate 2" — p — 1 interactions among p predictors and compute partial 
correlations for every one of them in order to decide which ones to include in our 
formal model. 


A) (А) 


0 B © 
(o © © © (0) (b (b © 


Classification Trees 


Regression trees are parallel to regression/ANOVA modeling, in which the dependent 
variable is quantitative. Classification trees are parallel to discriminant analysis and 
algebraic classification methods. Kass (1980) proposed a modification to AID called 
CHAID for categorized dependent and independent variables. His algorithm 
incorporated a sequential merge-and-split procedure based on a chi-square test 
statistic. Kass was concerned about computation time (although this has since proved 
an unnecessary worry), so he decided to settle for a suboptimal split on each predictor 
instead of searching for all possible combinations of the categories. Kass's algorithm 
is like sequential crosstabulation. For each predictor: 


= Crosstabulate the m categories of the predictor with the k categories of the 
dependent variable. 


m Find the pair of categories of the predictor whose 2 x k subtable is least 
significantly different on a chi-square test and merge these two categories. 
m Ifthe chi-square test statistic is not "significant" according to a preset critical value, 


repeat this merging process for the selected predictor until no nonsignificant chi- 
square is found for a subtable. 
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m Choose the predictor variable whose chi-square is the largest and split the sample 
into / € m subsets, where / is the number of categories resulting from the merging 
process on that predictor. 


m Continue splitting, as with AID, until no significant chi-squares result. 


The CHAID algorithm saves computer time, but it is not guaranteed to find the splits 
that predict best at a given step. Only by searching all possible category subsets can we 
do that. CHAID is also limited to categorical predictors, so it cannot be used for 
quantitative or mixed categorical-quantitative models, as in the figure on page 43. 
Nevertheless, it is an effective way to search heuristically through rather large tables 
quickly. 


Note: Within the computer science community, there is a categorical splitting literature 
that often does not cite the statistical work and is, in turn, not frequently cited by 
statisticians (although this has changed in recent years). Quinlan (1986, 1992), the best 
known of these researchers, developed a set of algorithms based on information theory. 
These methods, called ID3, iteratively build decision trees based on training samples 
of attributes. 


Stopping Rules, Pruning, and Cross- Validation 


AID, CHAID, and other forward-sequential tree-fitting methods share a problem with 
other tree-clustering methods—where do we stop? If we keep splitting, a tree will end 
up with only one case, or object, at each terminal node. We need a method for 
producing a smaller tree other than the exhaustive one. One way is to use stepwise 
statistical tests, as in the F-to-enter or alpha-to-enter rule for forward stepwise 
regression. We compute a test statistic (chi-square, F, etc.), choose a critical level for 
the test (sometimes modifying it with the Bonferroni inequality), and stop splitting any 
branch that fails to meet the test (Wilkinson, 1979, for a review of this procedure in 
forward selection regression). 

Breiman et al. (1984) showed that this method tends to yield trees with too many 
branches and can also fail to pursue branches that can add significantly to the overall 
fit. They advocate, instead, pruning the tree. After computing an exhaustive tree, their 
program eliminates nodes that do not contribute to the overall prediction. They add 
another essential ingredient, however—the cost of complexity. This measure is similar 
to other cost statistics, such as Mallows' C, (Neter et al.,1996), which add a penalty 
for increasing the number of parameters in a model. Breiman's method is not like 
backward elimination stepwise regression. It resembles forward stepwise regression 
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with a cutting back on the final number of steps using a different criterion than the F- 
to-enter. This method still cannot do as well as an exhaustive search, which would be 
prohibitive for most practical problems. 

Regardless of how a tree is pruned, it is important to cross-validate it. As with 
stepwise regression, the prediction error for a tree applied to a new sample can be 
considerably higher than for the training sample on which it was constructed. 
Whenever possible, data should be reserved for cross-validation. 


Loss Functions 


Different loss functions are appropriate for different forms of data. TREES offers a 
variety of functions that are scaled as proportional reduction in error (PRE) statistics. 
This allows you to try different loss functions on a problem and compare their 
predictive validity. 

For regression trees, the most appropriate loss functions are least-squares, trimmed 
mean, and least absolute deviations. Least-squares loss yields the classic AID tree. At 
each split, cases are classified so that the within-group sum of squares about the mean 
of the group is as small as possible. The trimmed mean loss works the same way but 
first trims 20% of outlying cases (10% at each extreme) in a splittable subset before 
computing the mean and sum of squares. It can be useful when you expect outliers in 
subgroups and don’t want them to influence the split decisions. LAD loss computes 
least absolute deviations about the mean rather than squares. It, too, gives less weight 
to extreme cases in each potential group. 

For classification trees, use the phi coefficient (the default), Gini index, or “twoing.” 
The phi coefficient is c?/n for a 2 x k table formed by the split on k categories of the 
dependent variable. The Gini index is a variance estimate based on all comparisons of 
possible pairs of values in a subgroup. Finally, twoing is a word coined by Breiman et 
al. (1984) to describe splitting k categories as if it were a two-category splitting 
problem. For more information about the effects of Gini and twoing on computations, 
see Breiman et al. (1984). 


Geometry 


Most discussions of trees versus other classifiers compare tree graphs and al gebraic 
equations. However, there is another graphic view of what a tree classifier performs. If 
we look at the cases embedded in the space of the predictor variables, we can ask how 
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a linear discriminant analysis partitions the cases and how a tree classifier partitions 
them. 

The figure below shows how cases are split by a linear discriminant analysis. There 
are three subgroups of cases in this example. The cutting planes are positioned 
approximately halfway between each pair of group centroids. Their orientation is 
determined by the discriminant analysis. With three predictors and four groups, there 
are six cutting planes, although only four planes show in the figure. The fourth group 
is assumed to be under the bottom plane in the figure. In general, if there are g groups, 
the linear discriminant model cuts them with g(g-/)/2 planes. 


The figure below shows how a tree-fitting algorithm cuts the same data. Only the 
nearest subgroup (dark spots) shows; the other three groups are hidden behind the rear 
and bottom cutting planes. Notice that the cutting planes are parallel to the axes. While 
this would seem to restrict the discrimination compared to the more flexible angles 
allowed by the discriminant planes, the tree model allows interactions between 
variables, which do not appear in the ordinary linear discriminant model. Notice, for 
example, that one plane splits on the X variable, but the second plane that splits on the 
Y variable cuts only the values to the left of the X partition. The tree model can continue 
to cut any of these subregions separately, unlike the discriminant model, which can cut 
only globally and with g(g-/)/2 planes. This is a mixed blessing, however, since tree 
methods, as we have seen, can over-fit the data. It is critical to test them on new 


samples. 

Tree models are not usua 
it is helpful to see that they have a geom 
construct algebraic expressions for trees. 


lly related by authors to dimensional plots in this way, but 
etric interpretation. Alternatively, we can 
They would require dummy variables for any 
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categorical predictors and interaction (or product) terms for every split whose 
descendants (or lower nodes) did not involve the same variables on both sides. 
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Classification and Regression Trees in SYSTAT 


Classification and Regression Trees Dialog Box 


To open the Classification and Regression Trees dialog box, from the menus choose: 


Advanced 
Trees (C&RT)... 


** Advanced: Trees (C&RT) 
Model | Stoppi 
Available varable(s]- Dependent: 
"SPECIES SPECIES 
SEPALLEN heme] 
SEPALWID ~~ Independent{s): 
PETALLEN SEPALWID 


PETALWID SEPALLEN 


<- Remove | 


[0 Expand model 


Loss: Least squares a 


Display nodes as: [Tent L] 


Model selection and estimation are available in the Model tab of the Classification and 
Regression Trees dialog box: 


Dependent. The variable you want to examine. The dependent variable should be 
continuous or categorical numeric variables (for example, INCOME). 


Independent(s). Select one or more continuous or categorical variables (grouping 
variables). 


Expand model. Adds all possible sums and differences of the predictors to the model. 


Loss. Select a loss function from the drop-down list. 


1-52 


Chapter 3 


Least-squares. The least-squared loss (AID) minimizes the sum of the squared 
deviations. 


Trimmed mean. The trimmed mean loss (TRIM) “trims” the extreme observations 
(20%) prior to computing the mean. 


Least absolute deviations. The least absolute deviations loss (LAD). 


Phi coefficient. The phi coefficient loss computes the correlation between two 
dichotomous variables. 


m Gini index. The Gini index loss measures inequality or dispersion. 


Twoing. The twoing loss function. 


Display nodes as. Select the type of density display. The following types are available: 


Box plot. Plot that uses boxes to show a distribution shape, central tendency, and 
variability. 

Dot histogram (Dit). Produces a density display that looks similar to a histogram. 
Unlike histograms, dot histograms represent every observation with a unique 
symbol, so they are especially suited for small- to moderate-size samples of 
continuous data. 


Symmetrical dot density (Dot). Plot that displays dots at the exact locations of data 
values. 

Jittered dot density. Plot that calculates the exact locations of the data values, but 
jitters points randomly оп a short vertical axis to keep points from colliding. 
Density stripes. Places vertical lines at the location of data values along a 
horizontal data scale and looks like supermarket bar codes. 

Text. Displays text output in the tree diagram including the mode, sample size, and 
impurity value. 
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Stopping Criteria 


The Stopping Criteria tab contains the parameters for controlling stopping. 


X? Advanced: Trees (СВЕТ) 


Number of splits: 
Minimum proportion: 


Split minimum: 


Minimum objects at end of trees: 


Specify the criteria for splitting to stop. 


Number of splits. Maximum number of splits. 


Minimum proportion. Minimum proportion reduction in error (PRE) for the tree 


allowed at any split. 


Split minimum. Minimum split value allowed at any node. 


Minimum objects at end of trees. Minimum count allowed at any node. 
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Using Commands 


After selecting a file with USE FILENAME, continue with: 


TREES 

MODEL yvar = xvarlist / EXPAND 

ESTIMATE / PMIN=d, SMIN=d, NMIN-n, NSPLIT=n, LOSS=LSQ 
TRIM LAD PHI GINI TWOING, DENSITY=STRIPE JITTER DOT DIT BOX 
SAMPLE =BOOT(m,n) or SIMPLE(m,n) or JACK 


Usage Considerations 


Types of data. TREES uses rectangular data only. 


Print options. The default output includes the splitting history and summary statistics. 
PLENGTH LONG adds a SYSTAT program for classifying new observations. You can 
cut and paste this SYSTAT program into a commandspace and submit it to classify new 
data on the same variables for cross-validation and prediction. 


Quick Graphs. TREES produces a Quick Graph for the fitted tree. The nodes may 
contain text describing split parameters or they may contain density graphs of the data 
being split. A dashed line indicates that the split is not significant. 


Saving files. TREES does not save files. Use the SYSTAT program under PLENGTH 
LONG to classify your data, compute residuals, etc., on old or new data. 


BY groups. TREES analyzes data by groups. Your file need not be sorted on the BY 
variable(s). 


Case frequencies. FREQ <variable> increases the number of cases by the FREQ variable. 


Case weights. WEIGHT is not available in TREES. 


Examples 


The following examples illustrate the features of the TREES module. The first example 
shows a classification tree for the Fisher-Anderson iris data set. The second example 
is a regression tree on an example taken from Breiman et al. (1984), and the third is a 
regression tree predicting the danger of a mammal being eaten by predators. 
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Example 1 
Classification Tree 


This example shows a classification tree analysis of the Fisher-Anderson iris data set 
featured in Discriminant Analysis. We use the Gini loss function and display a 
graphical tree, or mobile, with dot histograms, or dit plots. 


The input is: 


USE IRIS 
LAB SPECIES/1='SETOSA’ ,2-' VERSICOLOR' , 3=' VIRGINICA’ 
TREES 
MODEL SPECIES=SEPALLEN, SEPALWID, PETALLEN, PETALWID 
ESTIMATE/LOSS=GINI , DENSITY=DIT 


The output is: 
Categorical Values Encountered during Processing are 
Variables Levels 
SPECIES setosa versicolor virginica 
(3levels) 
Split Variable PRE Improvement 
1 PETALLEN 0.500 0.500 
2 PETALWID 0.890 0.390 
Fitting Method : Gini Index 
Predicted Variable : SPECIES 
Minimum Split Index Value : 0.050 
Minimum Improvement in PRE : 0,050 
Maximum Number of Nodes Allowed 5 21 
Minimum Count Allowed in Each Node 25 
Number of Terminal Nodes in Final Tree : 3 
Proportional Reduction in Error (PRE) : 0.890 
Node From Count Mode Impurity Split Variable Cut Value Fit 
1 Zu 70 VU 150 0.333 PETALLEN 3.000 0.500 
2 1 50 setosa 0.000 
3 1 100 0.250 PETALWID 1.800 0.779 
4 3 54 versicolor 0.084 
5 3 46 virginica 0.021 


The PRE for the whole tree is 0.89 (similar to R? for a regression model), which is not 
bad. Before exulting, however, we should keep in mind that while Fisher chose the iris 
data set to demonstrate his discriminant model on real data, it is barely worthy of the 
effort. We can classify the data almost perfectly by looking at a scatterplot of petal 
length against petal width. 

The unique SYSTAT display of the tree is called a mobile (Wilkinson, 1995). The 
dit plots are ideal for illustrating how it works. Imagine each case is a marble in a box 
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at each node. The mobile simply balances all of the boxes. The reason for doing this is 
that we can easily see splits that cut only a few cases out of a group. These nodes will 
hang out conspicuously. It is fairly evident in the first split, for example, which cuts 
the population into half as many cases on the right (petal length less than 3) as on the 
left. 

This display has a second important characteristic that is different from other tree 
displays. The mobile coordinates the polarity of the terminal nodes (red on color 
displays) rather than the direction of the splits. This design has three consequences: we 
can evaluate the distributions of the subgroups on a common scale, we can see the 
direction of the splits on each splitting variable, and we can look at the distributions on 
the terminal nodes from left to right to see how the whole sample is split on the 
dependent variable. 

The first consequence means that every box containing data is a miniature density 
display of the subgroup’s values on a common scale (same limits and same direction). 
We don’t need to “drill down” on the data in a subgroup to see its distribution. It is 
immediately apparent in the tree. Suppose you prefer box plots or other density 
displays, simply use: 


DENSITY = BOX 


or another density as an ESTIMATE option. Dit plots are most suitable for classification 
trees, however; because they spike at the category values, they look like bar charts for 
categorical data. For continuous data, dit plots look like histograms. Although they are 
my favorite density display for this purpose, they can be time consuming to draw on 
large samples, so text summary is the default graphical display. 

The second consequence of ordering the splits according to the polarity of the 
dependent (rather than the independent) variable is that the direction of the split can be 
recognized immediately by looking at which side (left or right) the split is displayed 
on. Notice that PETALLEN < 3.000 occurs on the left side of the first split. This means 
that the relation between petal length and species (coded 1..3) is positive. The same is 
true for petal width within the second split group because the split banner occurs on the 
left. Banners on the right side of a split indicate a negative relationship between the 
dependent variable and the splitting variable within the group being split, as in the 
regression tree examples. 

The third consequence of ordering the splits is that we can look at the terminal nodes 
from left to right and see the consequences of the split in order. In the present example, 
notice that the three species are ordered from left to right in the same order that they 
are coded. You can change this ordering for a categorical variable with the CATEGORY 
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and ORDER commands. Adding labels, as we did here, makes the output more 


interpretable. 
Decision Tree 
SPECIES 
РЕТ) <300 
PETAL! «1.80 
Example 2 


Regression Tree with Box Plots 


This example shows a simple AID model. The data set is Boston housing prices, cited 
in Belsley et al. (1980) and used in Breiman et al. (1984). We are predicting median 


home values (МЕРУ) from a set of demographic variables. 


The input is: 


USE BOSTON 

TREES 
MODEL MEDV-CRIM..LSTAT 
ESTIMATE/PMIN-.005,DENS ITY-BOX 
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The output is: 

Split Variable PRE Improvement 
1 RM 0.453 0.453 
2 RM 0.524 0.072 
3 LSTAT 0.696 0.171 
4 PTRATIO 0.706 0.010 
5 LSTAT 0.723 0.017 
6 DIS 0.782 0.059 
7 CRIM 0.809 0.027 
8 мох 0.815 0.006 

Fitting Method : Least Squares 

Predicted Variable : МЕРУ 

Minimum Split Index Value : 0.050 

Minimum Improvement in PRE : 0.005 

Maximum Number of Nodes Allowed 201 

Minimum Count Allowed in Each Node : Э 

Number of Terminal Nodes іп Final Tree : 9 

Proportional Reduction in Error (PRE) : 0.815 
Node From Count Mean Standard Split Variable Cut Value Fit 

Deviation 

еі 0 506 22.533 9.197 RM 6.943 0.453 
2 3. 430 19.934 6.353 LSTAT 14.430 0.422 
3 1 76 37.238 8.988 RM 7.454 0.505 
4 3 46 32.113 6.497 LSTAT 11.660 0.382 
5 3 30 45.097 6.156 РТВАТІО 18.000 0.405 
6 2 255 23.350 5.110 DIS 1.413 0.380 
Њу 2 175 14.956 4.403 CRIM 7.023 0.337 
8 5 25 46.820 3.768 
9 5 5 36.480 8.841 
10 4 41 33.500 4.594 
и) 4 5 20.740 9.080 
12 6 5 45.580 9.883 
13 6 250 22.905 3.866 
14 7 101 17.138 3.392 мох 0.538 0.227 
15 7 74 11.978 3.857 
16 14 24 20.021 3.067 
17 14 77 16.239 2.975 


The Quick Graph of the tree more clearly reveals the sample-size feature of the mobile 
display. Notice that a number of the splits, because they separate out a few cases only, 
are extremely unbalanced. This can be interpreted in two ways, depending on context. 
On the one hand, it can mean that outliers are being separated so that subsequent splits 
can be more powerful. On the other hand, it can mean that a split is wasted by focusing 
on the outliers when further splits don’t help to improve the prediction. The former 
case appears to apply in our example. The first split separates out a few expensive 
housing tracts (the median values have a positively skewed distribution for all tracts), 
which makes subsequent splits more effective. The box plots in the terminal nodes are 
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narrow. 


Decision Tree 


Example 3 
Regression Tree with Dit Plots 


This example involves predicting the danger of a mammal being eaten by predators 

(Allison and Cicchetti, 1976). The predictors are hours of dreaming and non-dreaming 
sleep, gestational age, body weight, and brain weight. Although the danger index has 
only five values, we are treating it as a quantitative variable with meaningful numerical 


values. 
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The input is: 


USE SLEEP 
TREES 
MODEL DANGER-BODY WT,BRAIN WT, 
SLO SLEEP,DREAM SLEEP,GESTATE 
ESTIMATE / DENSITY-DIT 


The output is: 

18 Cases Deleted due to Missing Data. 
Split Variable PRE Improvement 
1 DREAM SLEEP 0.404 0.404 
2 BODY WT 0.479 0.074 
3 SLO SLEEP 0.547 0.068 


Fitting Method : Least Squares 
Predicted Variable : DANGER 
Minimum Split Index Value : 0.050 
Minimum Improvement in PRE : 0.050 
Maximum Number of Nodes Allowed zak 
Minimum Count Allowed in Each Node : 9 
Number of Terminal Nodes in Final Tree : 4 
Proportional Reduction in Error (PRE) : 0.547 


Node From Count Mean Standard Split Variable Cut Value Fit 
Deviation 


44 2.659 1.380 DREAM SLEEP 1.200 0.404 


1 о 

2 1 14) 3.929 1.072 BODY_WT 4.190 0.408 
3 1 30 2.067 1.081 SLO_SLEEP 12.800 0.164 
4 2 6 3.167 1.169 P 

5 2 8 4.500 0.535 

6 3 23 2.304 1.105 

7 3 7 1.286 0.488 
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Decision Tree 


The prediction is fairly good (PRE — 0.54 7). The Quick Graph of this tree illustrates 
another feature of mobiles. The dots in each terminal node are assigned a separate 
color. This way, we can follow their path up the tree each time they are merged. If the 
prediction is perfect, the top density plot will have colored dots perfectly separated. 
The extent to which the colors are mixed in the top plot is a visual indication of the 
badness-of-fit of the model. The fairly good separation of colors for the sleep data is 
quite clear on the computer screen or with color printing but less evident in a black and 


white figure. 
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Computation 


Algorithms 


TREES uses algorithms from Breiman et al. (1984) for its splitting computations. 


Missing Data 


Missing data are eliminated from the calculation of the loss function for each split 
separately. 
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Cluster Analysis 


Leland Wilkinson, Laszlo Engelman, James Corter, and Mark Coward 
(Revised by Siva Athreya, Mousum Dutta, and Goutam Peri) 


SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric 
data matrices, Cluster analysis is a multivariate procedure for detecting natural 
groupings in data. It resembles discriminant analysis in one respect—the researcher 
seeks to classify a set of objects into subgroups although neither the number nor 
members of the subgroups are known. 

CLUSTER provides three procedures for clustering: Hierarchical Clustering, 
K-Clustering, and Additive Trees. The Hierarchical Clustering procedure comprises 
hierarchical linkage methods. The K-Clustering procedure splits a set of objects into 
a selected number of groups by maximizing between-cluster variation and 
minimizing within-cluster variation. The Additive Trees Clustering procedure 
produces a Sattath-Tversky additive tree clustering. 

Hierarchical Clustering clusters cases, variables, or both cases and variables 
simultaneously; K-Clustering clusters cases only; and Additive Trees clusters a 
similarity or dissimilarity matrix. Several distance metrics are available with 
Hierarchical Clustering and K-Clustering including metrics for binary, quantitative 
and frequency count data. Hierarchical Clustering has ten methods for linking clusters 
and displays the results as a tree (dendrogram) or a polar dendrogram. When the 
MATRIX option is used to cluster cases and variables, SYSTAT uses a gray-scale or 
color spectrum to represent the values. 

SYSTAT further provides five indices, viz., statistical criteria by which an 
appropriate number of clusters can be chosen from the Hierarchical Tree. Options for 
cutting (or pruning) and coloring the hierarchical tree are also provided. 

In the K-Clustering procedure SYSTAT offers two algorithms, KMEANS and 
KMEDIANS, for partitioning. Further, SYSTAT provides nine methods for selecting 


initial seeds for both KMEANS and KMEDIANS. 
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Resampling procedures are available only in Hierarchical Clustering. 


Statistical Background 


Cluster analysis is a multivariate procedure for detecting groupings in data. The objects 
in these groups may be: 


Ш Cases (observations or rows of a rectangular data file). For example, suppose 
health indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) 
are recorded for countries (cases), then developed nations may form a subgroup or 
cluster separate from developing countries. 


= Variables (characteristics or columns of the data). For example, suppose causes of 
death (cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded 
for each U.S. state (case); the results show that accidents are relatively independent 
of the illnesses. 


и Cases and variables (individual entries in the data matrix). For example, certain 
wines are associated with good years of production. Other wines have other years 
that are better. 


Types of Clustering 


Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the 
same object to appear in more than one cluster. Exclusive clusters do not. All of the 
methods implemented in SYSTAT are exclusive. 

There are three approaches to producing exclusive clusters: hierarchical, 
partitioned, and additive trees. Hierarchical clusters consist of clusters that completely 
contain other clusters that in turn completely contain other clusters, and so on, until 
there is only one cluster. Partitioned clusters contain no other clusters. Additive trees 
use a graphical representation in which distances along branches reflect similarities 
among the objects. 

The cluster literature is diverse and contains many descriptive synonyms: 
hierarchical clustering (McQuitty, 1960; Johnson, 1967); single linkage clustering 
(Sokal and Sneath, 1963), and joining (Hartigan, 1975). Output from hierarchical 
methods can be represented as a tree (Hartigan, 1975) or a dendrogram (Sokal and 
Sneath, 1963). Density estimates (Hartigan 1975; Wong and Lane, 1983) can be used 
for clustering. Silverman (1986) provides several methods for density estimation. 
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Correlations and Distances 


To produce clusters, we must be able to compute some measure of dissimilarity 
between objects. Similar objects should appear in the same cluster, and dissimilar 
objects, in different clusters. All of the methods available in CORR for producing 
matrices of association can be used in cluster analysis, but each has different 
implications for the clusters produced. Incidentally, CLUSTER converts correlations to 
dissimilarities by negating them. 

In general, the correlation measures (Pearson, Mu2, Spearman, Gamma, Tau) are 
not influenced by differences in scales between objects. For example, correlations 
between states using health statistics will not in general be affected by some states 
having larger average numbers or variation in their numbers. Use correlations when 
you want to measure the similarity in patterns across profiles regardless of overall 
magnitude. 

On the other hand, the other measures such as Euclidean and City (city-block 
distance) are significantly affected by differences in scale. For health data, two states 
will be judged to be different if they have differing overall incidences even when they 
follow a common pattern. Generally, you should use the distance measures when 
variables are measured on common scales. 


Standardizing Data 


Before you compute a dissimilarity measure, you may need to standardize your data 
across the measured attributes. Standardizing puts measurements on a common scale. 
In general, standardizing makes overall level and variation comparable across 
measurements. Consider the following data: 


OBJECT хі х2 хз х4 


А 10 2 11 900 
В П 3 15 895 
С 13 4 12 760 
р 14 1 13 874 


If we are clustering the four cases (A through D), variable X4 will determine almost 
entirely the dissimilarity between cases, whether we use correlations or distances. If 
we are clustering the four variables, whichever correlation measure we use will adjust 
for the larger mean and standard deviation on X4. Thus, we should probably 
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standardize within columns if we are clustering rows and use a correlation measure if 
we are clustering columns. 

In the example below, case A will have a disproportionate influence if we are 
clustering columns. 


OBJECT хі x2 X3 X4 


A 410 311 613 514 
B 1 3 2 4 
С 10 П 12 10 
р 12 13 13 11 


We should probably standardize within rows before clustering columns. This requires 
transposing the data before standardization. If we are clustering rows, on the other 
hand, we should use a correlation measure to adjust for the larger mean and standard 
deviation of case A. 

These are not immutable laws. The suggestions are only to make you realize that 
scales can influence distance and correlation measures. 


Hierarchical Clustering 


In Hierarchical Clustering, initially, each object (case or variable) is considered as a 
separate cluster. Then two ‘closest’ objects are joined as a cluster and this process is 
continued (in a stepwise manner) for joining an object with another object, an object 
with a cluster, or a cluster with another cluster until all objects are combined into one 
single cluster. This Hierarchical clustering is then displayed pictorially as a tree 
referred to as the Hierarchical tree. 

The term ‘closest’ is identified by a specified rule in each of the Linkage methods. 
Hence in different linkage methods, the corresponding distance matrix (or dissimilarity 
measure) after each merger is computed by a different formula. These formulas are 
briefly explained below. 


Linkage Methods 


SYSTAT provides the following linkage methods: Single, Complete, Average. 
Centroid, Median, Ward’s (Ward, 1963), Weighted Average and Flexible Beta. Аз 
explained above, each method differs in how it measures the distance between two 
clusters and consequently it influences the interpretation of the word ‘closest’. Initially, 
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the distance matrix gives the original distance between clusters as per the input data. 
The key is to compute the new distance matrix every time any two of the clusters are 
merged. This is illustrated via a recurrence relationship and a table. 

Suppose А, Р, О are existing clusters and Р+О is the cluster formed by merging 
cluster P and cluster Q, and nx is the number of objects in the Cluster X. The distance 
between the two clusters R and Р+О is calculated by the following relationship: 


d(R,P--Q) = wd (В, P) + wd (Е,О) + уза (Р.О) + wald(R, P) - а (R,Q)| 
where the weights wj, Wz, ws, W4 are method specific, provided by the table below: 


Name wi му из Wa 
Single 1/2 1/2 0 -1/2 
Complete 1/2 1/2 0 1/2 
Average np/(np*ng) по/(пр+по) 0 0 
Weighted 1/2 1/2 0 
Centroid ^ npAnptng) noÁnp*no) «(png (m*ngY 0 
Median 1/2 1/2 -1/4 0 
Ward (ngtnp)/(ngtnptng) (пр+по)/(пр+пр+по) пр/(пр+пр+по) 0 
Felxibeta (1-8 )/2 (1-В)/2 0 


From ће above table it can be easily inferred that in a single linkage the distance 
between two clusters is the minimum of the distance between all the objects in the two 
clusters. Once the distances between the clusters are computed, the closest two are 
merged. The other methods can be suitably interpreted as well. Further descriptive 
details of the methods are given in the dialog-box description section. 


Density Linkage Method 


SYSTAT provides two density linkage methods: the Uniform Kernel method and the 
Кї Nearest Neighborhood method. In these methods a probability density estimate on 


the cases is obtained. Using this and the given dissimilarity matrix, a new dissimilarity 


matrix is constructed. Finally the single linkage cluster analysis is performed on the 


cases using the new dissimilarity measure. | : | 
For the uniform kernel method, you provide a value for the radius r. Using this, the 


density at a case x is estimated as the proportion of the cases in the sphere of radius ғ, 
centered at x. In the А nearest neighborhood method, you provide the value of k upon 
which SYSTAT estimates the density at a case x as the proportion of cases in the sphere 
centered at x and the radius given by the distance to к® nearest neighbor of x. In each 
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of the above methods, the new dissimilarity measure between two cases is given by the 
average of the reciprocal of the density values of the two cases if they both lie within 
the same sphere of reference; otherwise, they are deemed to be infinite. 

To understand the cluster displays of hierarchical clustering, it is best to look at an 
example. The following data reflect various attributes of selected performance cars. 
ACCEL BRAKE SLALOM MPG SPEED NAMES 

5.0 245 61.3 17.0 153 Porsche 911T 

5.3 242 61.9 12.0 181 Testarossa 

5.8 243 62.6 19.0 154 Corvette 

7.0 267 57.8 14.5 145 Mercedes 560 

7.6 271 59.8 21.0 124 Saab 9000 

19 259 617 19.0 130 Toyota Supra 

8.5 263 59.9 17.5 131 BMW 635 

8.7 287 64.2 35.0 115 Civic CRX 

9.3 258 64.1 24.5 129 Acura Legend 

10.8 287 60.8 25.0 100 VW Fox GL 
13.0 253 62.3 27.0 95 Chevy Nova 
Cluster Displays 


SYSTAT displays the output of hierarchical clustering in several ways. For joining 
rows or columns, SYSTAT prints a tree. For matrix joining, it prints a shaded matrix. 


Trees. A tree is printed with a unique ordering in which every branch is lined up such 
that the most similar objects are closest to each other. If a perfect seriation 
(one-dimensional ordering) exists in the data, the tree reproduces it. The algorithm for 
ordering the tree is given in Gruvaeus and Wainer (1972). This ordering may differ 
from that of trees printed by other clustering programs if they do not use a seriation 
algorithm to determine how to order branches. The advantage of using seriation is most 
apparent for single linkage clustering. 

If you join rows, the end branches of the tree are labeled with case numbers ог 
labels. If you join columns, the end branches of the tree are labeled with variable 
names. 


Direct display of a matrix. As ап alternative to trees, SYSTAT can produce а shaded 
display of the original data matrix in which rows and columns are permuted according 
to an algorithm in Gruvaeus and Wainer (1972). Different characters represent the 
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magnitude of each number in the matrix (see Ling, 1973). A legend showing the range 
of data values that these characters represent appears with the display. 

Cutpoints between these values and their associated characters are selected to 
heighten contrast in the display. The method for increasing contrast is derived from 
techniques used in computer pattern recognition, in which gray-scale histograms for 
visual displays are modified to heighten contrast and enhance pattern detection. To 
find these cutpoints, we sort the data and look for the largest gaps between adjacent 
values. Tukey’s gapping method (See Wainer and Schacht, 1978) is used to determine 
how many gaps (and associated characters) should be chosen to heighten contrast for 
a given set of data. This procedure, time consuming for large matrices, is described in 
detail in Wilkinson (1979). 

If you have a course to grade and are looking for a way to find rational cutpoints in 
the grade distribution, you might want to use this display to choose the cutpoints. 
Cluster the n x 1 matrix of numeric grades (и students by 1 grade) and let SYSTAT 
choose the cutpoints. Only cutpoints asymptotically significant at the 0.05 level are 
chosen. If no cutpoints are chosen in the display, give everyone an A, flunk them all, 
or hand out numeric grades (unless you teach at Brown University or Hampshire 


College). 
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Clustering Rows 


First, let us look at possible clusters of the cars in the example. Since the variables are 
on such different scales, we will standardize them before doing the clustering. This will 
give acceleration comparable influence to braking. Then we select Pearson 
correlations as the basis for dissimilarity between cars. The result is: 


Cluster Tree 


If you look at the correlation matrix for the cars, you will see how these clusters hang 
together. Cars within the same cluster (for example, Corvette, Testarossa, Porsche) 
generally correlate highly. 
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Porsche 
Testa 
Corv 
Merc 
Saab 
Toyota 
BMW 
Civic 
Acura 
VW 
Chevy 


Toyota 
BMW 
Civic 
Acura 
VW 
Chevy 


Porsche 
1.00 
0.94 
0.94 
0.09 

-0.51 
0.24 

-0.32 

-0.50 

-0.05 

-0.96 

-0.73 


Toyota 
1.00 
-0.25 
-0.30 
0.53 
-0.35 
-0.03 


Clustering Columns 


Testa 


1.00 
0.87 
0.21 
-0.52 
0.43 
-0.10 
-0.73 
-0.10 
-0.93 
-0.70 


BMW 


1.00 
-0.50 
-0.79 

0.39 
-0.06 


Согу 


1.00 
-0.24 
-0.76 

0.40 
-0.56 
-0.39 

0.30 
-0.98 
-0.49 


Civic 


1.00 
0.35 
0.55 
0.32 


Merc 


1.00 
0.66 
-0.38 
0.85 
-0.52 
-0.98 
0.08 
-0.53 


Acura 


1.00 
-0.16 
0.54 


Saab 


1.00 
0.53 


Cluster Analysis 


We can cluster the performance attributes of the cars more easily. Here, we do not need 
to standardize within cars (by rows) because all of the values are comparable between 
cars, Again, to give each variable comparable influence, we will use Pearson 
correlations as the basis for the dissimilarities. The result based on the data 


standardized by variable (column) is: 
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Cluster Tree 


ТЕЗДЕ 2 т=з. ай с 
о 02 04 06 08 10 12 
Distances 


Clustering Rows and Columns 


To cluster the rows and columns jointly, we should first standardize the variables to 
give each of them comparable influence on the clustering of cars. Once we have 
standardized the variables, we can use Euclidean distances because the scales are 
comparable. 


Single linkage is used to produce the following result: 
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Permuted Data 


ЛГА 


Index of Variable 


This figure displays the standardized data matrix itself with rows and columns 
permuted to reveal clustering and each data value replaced by one of three symbols. 
Note that the rows are ordered according to overall performance, with the fastest cars 
at the top. 

Matrix clustering is especially useful for displaying large correlation matrices. You 
may want to cluster the correlation matrix this way and then use the ordering to 
produce a scatterplot matrix that is organized by the multivariate structure. 


Cluster Validity Indices 


The fundamental aim of the cluster validity indices is to enable the user to choose an 
optimal number of clusters in the data subject to pre-defined conditions. Milligan and 
Cooper (1985) studied several such indices. In this section we discuss five indices that 
are provided by SYSTAT for Hierarchical clustering. 


Root Mean Square Standard Deviation (RMSSTD) Index. This index is the root mean 
square standard deviation of all the variables within each cluster. This is calculated by 
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calculating the within-group sum of squares of each cluster and normalizing it by the 
product of the number of elements in the cluster and the number of variables (Sharma, 
1995). More precisely, 


RMSSTD = NW,/(V(Nk- 1) ) 


where W, is the within-group sum of squares of cluster k, Ny is the number of 
elements in cluster К and v is the number of variables. SYSTAT calculates the index at 
each step of the Hierarchical algorithm providing a measure of homogeneity of the 
clusters that have been formed. Thus, the smaller the value of the RMSSTD, the better 
is the cluster formed. At any hierarchical step, if the RMSSTD value rises then the new 
clustering scheme is worse. 

SYSTAT provides a plot of RMSSTD for a number of steps in the hierarchical 
clustering. You can then determine the number of clusters that exist in a data set by 
spotting the ‘knee’ (in other words, the steep jump of the index value from higher to 
smaller numbers of clusters) in the graph. This index is valid for rectangular data. Ifa 
dissimilarity matrix is available, then the index is valid only if the methods used are 
average, centroid or Ward. 


Dunn’s Index. This cluster validity index was proposed by Dunn (197 3). Suppose the 
number of clusters at a given level in the hierarchical cluster tree is К. For any two 
clusters X; and X; let (Xi, X;) be the distance between two clusters and A Xi) 
be the diameter of cluster X;. Dunn’s index is defined as the minimum of the ratio of 
the dissimilarity measure between two clusters to the diameter of cluster, where the 
minimum is taken over all the clusters in the data set. More precisely, 


Dunn’s Index = Min Min ӧ(Х, , Xj) 
1<i<k |1€jeiskMaxias AXI) 


Originally, the distance between two sets is defined as the minimum distance between 
two points taken from different sets, whereas the diameter of a set is defined as the 
maximum distance between two points in the set. A generalization of the above 
measurement can be found in Bezdek and Pal (1998). If the data set contains close-knit 
but separated clusters, the distance between the clusters is expected to be large and the 
diameter of the clusters is expected to be small. So, based on the definition, large 
values of the index indicate the presence of compact and well-separated clusters. Thus, 
the clustering which attains the maximum in the plot of Dunn’s versus the number of 


clusters, is the appropriate one. This index is valid for both rectangular and 
dissimilarity data. 
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Davies-Bouldin’s (DB’s) Index; Let k be the number of clusters at a given step in 
hierarchical clustering. Let Vx, denote the centre of the cluster X, and |X,| the size of 
the cluster X; . 

1 


1 2 1 
Define S; - (о УХ d'(x, 2) as the measure of dispersion of cluster X, , 
xeX, 


d= d (Vx; Vs), as the dissimilarity measure between clusters X, and X; and 


R; = БЕ 
ij k 


1 
Then the DB (Davies and Bouldin, 1979) Index is defined ав DB’s Index = УК : 
і 
It is clear that ОВ” index quantifies the average similarity between a cluster and its 
most similar counterpart. It is desirable for the clusters to be as distinct from each other 
as possible. So a clustering which minimizes the DB index is the ideal one. This index 


can be calculated for rectangular data. 


Pseudo F Index. The pseudo F statistic describes the ratio of between-cluster variance 
to within cluster variance (Calinski and Harabasz, 1974): 


(GSS)/(K - 1) 

Pseudo F= (WSS)/(N-K) -K) 
where N is the number of observations, K is the number of clusters at any step in the 
hierarchical clustering, GSS is the between-group sum of squares, and WSS is the 
within group sum of squares. Large values of Pseudo F indicate close-knit and 
separated clusters. In particular, peaks in the pseudo F statistic are indicators of greater 
cluster separation. Typically, these are spotted in the plot of the index versus the 
number of clusters. This index is valid for rectangular data and for any Hierarchical 
clustering procedure. In the case of dissimilarity data, one can use this index for 
hierarchical clustering if the methods used are average, centroid or Ward. 


Pseudo T-square Index. Suppose, during a step in the hierarchical clustering, cluster K 
and cluster L are merged to form a new cluster, Then, the pseudo 7-square statistic for 


the clustering obtained is given by 


Ви 
Se шасы 
Pseudo T-square = (үү + үу) (Ne + Ni —2)) 


where Nx and N, are the number of observations in clusters kand /, Wx and №. 


are within cluster sum of squares of clusters k and /, and Вк. is the between-cluster 
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sum of squares. This index quantifies the difference between two clusters that are 
merged at a given step. Thus, if the pseudo 7-square statistic has a distinct jump at step 
k ofthe hierarchical clustering, then the clustering in step k+1 is selected as the optimal 
cluster. The pseudo 7-square statistic is closely related to Duda and Hart's 
(Je(2)/Je(1)) index. 


Partitioning via K-Clustering 


To produce partitioned clusters, you must decide in advance how many clusters you 
want. K-Clustering searches for the best way to divide your objects into K different 
sections so that they are separated as well as possible. K-Clustering provides two such 
procedures: K-Means and K-Medians. 


K-Means 


K-Means, which is the default procedure, begins by picking *seed' cases, one for each 
cluster, which are spread apart as much as possible from the centre of all the cases. 
Then it assigns all cases to the nearest seed. Next, it attempts to reassign each case to 
a different cluster in order to reduce the within-groups sum of squares. This continues 
until the within-groups sum of squares can no longer be reduced. The initial seeds can 
be chosen from nine possible options. 

K-Means does not search through every possible partitioning of the data, 50 it is 
possible that some other solution might have a smaller within-groups sum of squares. 
Nevertheless, it has performed relatively well on global data separated in several 
dimensions in Monte Carlo studies of cluster algorithms. 

Because it focuses on reducing the within-groups sum of squares, K-Means 
clustering is like a multivariate analysis of variance in which the groups are not known 
in advance. The output includes analysis of variance statistics, although you should be 
cautious in interpreting them. Remember, the program is looking for large F-ratios in 
the first place, so you should not be too impressed by large values. 

The following is a three-group analysis of the car data. The clusters are similar to 
those we found by joining. K-Means clustering uses Euclidean distances instead of 
Pearson correlations, so there are minor differences because of scaling. 
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To keep the influences of all variables comparable, we standardized the data before 
running the analysis. 


Distance Metric is Euclidean Distance 
K-Means splitting cases into 3 groups 


Summary Statistics for All Cases 


Variable Between SS df Within SS df F-ratio 
ACCEL 7.825 2 2.175 8 14.389 
BRAKE 5.657 2 4.343 8 5.211 
SLALOM 5.427 2 4.573 8 4.747 
MPG 7.148 2 2.852 8 10.027 
SPEED 7.677 2 2.323 8 13.220 
жж TOTAL 2% 33.735 10 16.265 40 


Cluster 1 of 3 Contains 4 Cases 


Members Statistics 
Standard 
Case Distance Variable Minimum Mean Maximum Deviation 
Mercedes 560 0.596 ACCEL -0.451 -0.138 0.174 0.260 
Saab 9000 0.309 BRAKE -0.149 0.230 0.608 0.326 
Toyota Supra 0.488 SLALOM -1.952 -0.894 0.111 0.843 
ВМИ 635 0.159 МРС -1.010 -0.470 -0.007 0.423 
SPEED -0.338 0.002 0.502 0.355 

Cluster 2 of 3 Contains 4 Cases 

Members Statistics 
Standard 
Case Minimum Mean Maximum Deviation 
Civic CRX 0.258 0.988 2.051 0.799 
Acura Legend -0.528 0.624 1.619 1,155 
VW Fox GL -0.365 0.719 1.432 0.857 
Сһеуу Моуа 0.533 1.054 2.154 0.752 
-1.498 -0.908 -0.138 0.616 

Cluster 3 of 3 Contains 3 Cases 

Members Statistics 
Standard 
Case Distance Variable Minimum Mean Maximum Deviation 
Porsche 911T 0.253 ACCEL -1.285 -1.132 -0.952 0.169 
Testarossa 0.431 BRAKE -1.223 -1.138 -1.033 0.096 
Corvette 0.314 SLALOM -0.101 0.234 0.586 0.344 
MPG -1.396 -0.779 -0.316 0.557 
SPEED 0.822 1.208 1.941 0.635 


K-Medians 


The second approach available in K-Clustering is K-Medians. The K-Medians 
procedure follows the same amalgamation approach as K-Means except for a key 
difference. It uses the median to reassign each case to a different cluster in order to 


reduce the within-groups sum of absolute deviations. 
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Sattath and Tversky (1977) developed additive trees for modeling similarity/ 
dissimilarity data. Hierarchical clustering methods require objects in the same cluster 
to have identical distances to each other. Moreover, these distances must be smaller 
than the distances between clusters. These restrictions prove problematic for similarity 
data, and, as a result, hierarchical clustering cannot fit this data set well. 

In contrast, additive trees use the tree branch length to represent distances between 
objects. Allowing the within-cluster distances to vary yields a tree diagram with 
varying branch lengths. Objects within a cluster can be compared by focusing on the 
horizontal distance along the branches connecting them. The additive tree for the car 
data is as follows: 


Additive Tree 
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The distances between nodes of the graph are: 


Node Length Child 
1 0.10 Porsche 
2 0.49 Testa 
3 0.14 Corv 
4 0.52 Merc 
5 0.19 Saab 
6 0.13 Toyota 
7 0.11 BMW 
8 0.71 Civic 
9 0.30 Acura 
10 0.42 VW 
11 0.62 Сһеуу 
12 0.06 1,2 
13 0.08 8,10 
14 0.49 12,3 
15 0.18 13,11 
16 0.35 9,15 
17 0.04 14,6 
18 0.13 17,16 
19 0.0 5,18 
20 0.04 4,7 
21 0.0 20,19 


Each object is а node in the graph. In this example, the first 11 nodes represent the cars. 
Other graph nodes correspond to “groupings” of the objects. Here, the 12th node 


represents Porsche and Testa. 
The distance between any two nodes is the sum of the (horizontal) lengths between 


them. The distance between Chevy and VW is 0.62 + 0.08 + 0.42 = 1.12. The 
distance between Chevy and Civic is 0.62 + 0.08 + 0.71 = 1.41. Consequently, 
Chevy is more similar to VW than to Civic. 
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Cluster Analysis іп SYSTAT 


Hierarchical Clustering Dialog Box 


Hierarchical clustering produces hierarchical clusters that are displayed in a tree. 
Initially, each object (case or variable) is considered a separate cluster. SYSTAT begins 
by joining the two “closest” objects as a cluster and continues (in a stepwise manner) 
joining an object with another object, an object with a cluster, or a cluster with another 
cluster until all objects are combined into one cluster. 


To open the Hierarchical Clustering dialog box, from the menus choose: 


Analyze 
Cluster Analysis 
Hierarchical... 


Analyze: Cluster Analysis: Hierarchical ДЕЗ 


Main | Options || Mahalanobis | Resampling 


Available variable(s]: Selected variable(s]: 
ACCEL 
BRAKE 
SLALOM 


MPG <-Нетоуе 
SPEED 
дал —— — ——— 
® Rows 
O Columns e 
| © Matrix Distance: 
O Polar 
E Save: | Cluster identifier 


Add--> 


Linkage: 


Number of cluste 


Ц 


You must select the elements of the data file to cluster (Join): 


Ш Rows. Rows (cases) of the data matrix are clustered. 
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Columns. Columns (variables) of the data matrix are clustered. 


Matrix. In Matrix, rows and columns of the data matrix are clustered—they are 
permuted to bring similar rows and columns next to one another. 


Linkage allows you to specify the type of joining algorithm used to amalgamate 
clusters (that is, define how distances between clusters are measured). 


Average. Average linkage averages all distances between pairs of objects in 
different clusters to decide how far apart they are. 


Centroid. Centroid linkage uses the average value of all objects in a cluster (the 
cluster centroid) as the reference point for distances to other objects or clusters. 


Complete. Complete linkage uses the most distant pair of objects in two clusters to 
compute between-cluster distances. This method tends to produce compact, 
globular clusters. If you use a similarity or dissimilarity matrix from a SYSTAT 
file, you get Johnson's “тах” method. 

Flexibeta. Flexible beta linkage uses a weighted average distance between pairs of 
objects in different clusters to decide how far apart they are. You can choose the 
value of the weight В . The range of В is between -1 and 1. 


K-nbd. КЗ nearest neighborhood method is a density linkage method. The 
estimated density is proportional to the number of cases in the smallest sphere 
containing the К nearest neighbor. А new dissimilarity matrix is then constructed 
using the density estimate. Finally the single linkage cluster analysis is performed. 
You can specify the number А; its range is between 1 and the total number of cases 


in the data set. 

Median. Median linkage uses the median distances between pairs of objects in 
different clusters to decide how far apart they are. 

Single. Single linkage defines the distance between two objects or clusters as the 
distance between the two closest members of those clusters. This method tends to 


produce long, stringy clusters. Ifyou use a SYSTAT file that contains a similarity 


or dissimilarity matrix, you get clustering via Johnson’s “min” method. 


Uniform. Uniform Kernel method is a density linkage method. The estimated 
density is proportional to the number of cases in a sphere of radius r. Anew 
dissimilarity matrix is then constructed using the density estimate. Finally, single 
linkage cluster analysis 1s performed. You can choose the number ғ; its range is the 


positive real line. 


Ward. Ward’s method averages all distances 
clusters, with adjustments for covariances, to 


between pairs of objects in different 
decide how far apart the clusters are. 
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ш Weighted. Weighted average linkage uses a weighted average distance between 
pairs of objects in different clusters to decide how far apart they are. The weights 
used are proportional to the size of the cluster. 


For some data, some methods cannot produce a hierarchical tree with strictly 
increasing amalgamation distances. In these cases, you may see stray branches that do 
not connect to others. If this happens, you should consider Single or Complete linkage. 
For more information on these problems, see Fisher and Van Ness (1971). These 
reviewers concluded that these and other problems made Centroid, Average, Median, 
and Ward (as well as K-Means) “inadmissible” clustering procedures. In practice and 
in Monte Carlo simulations, however, they sometimes perform better than Single and 
Complete linkage, which Fisher and Van Ness considered “admissible.” Milligan 
(1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation 
of clustering algorithms. Consult his paper for further details. 


In addition, the following options can be specified: 
Distance. Specifies the distance metric used to compare clusters. 
Polar. Produces a polar (circular) cluster tree. 


Save. Save provides two options either to save cluster identifiers or to save cluster 
identifiers along with data. You can specify the number of clusters to identify for the 
saved file. If not specified, two clusters are identified. 


Clustering Distances 


Both Hierarchical Clustering and K-Clustering allow you to select the type of distance 

metric to use between objects. From the Distance drop-down list, you can select: 

ш Absolute. Distances are computed using absolute differences. Use this metric for 
quantitative variables. The computation excludes missing values. 

m Anderberg. Distances are computed using a dissimilarity form of Anderberg’s 
similarity coefficients for binary data. Anderberg distance is available for 
hierarchical clustering only. 


m Chi-square. Distances are computed as the chi-square measure of independence of 
rows and columns on 2-by-n frequency tables, formed by pairs of cases (ог 
variables). Use this metric when the data are counts of objects or events. 


m Euclidean. Clustering is computed using normalized Euclidean distance (root 
mean squared distances). Use this metric with quantitative variables. Missing 
values are excluded from computations. 
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Gamma. Distances are computed using one minus the Goodman-Kruskal gamma 
correlation coefficient. Use this metric with rank order or ordinal scales. Missing 
values are excluded from computations. 


Jaccard. Clustering is computed using the dissimilarity form of Jaccard’s 
similarity coefficient for binary data. Jaccard distance is only available for 
hierarchical clustering. 


Mahalanobis. Distances are computed using the square root of the quadratic form 
of the deviations among two random vectors using the inverse of their variance- 
covariance matrix. This metric can also be used to cluster groups. Use this metric 
with quantitative variables. Missing values are excluded from computations. 


Minkowski. Clustering is computed using the pth root of the mean pth powered 
distances of coordinates. Use this metric for quantitative variables. Missing values 
are excluded from computations. Use the Power text box to specify the value of p. 


MW (available for K-Clustering only). Distances are computed as the increment in 
within sum of squares of deviations, if the case would belong to a cluster. The case 
is moved into the cluster that minimizes the within sum of squares of deviations. 
Use this metric with quantitative variables. Missing values are excluded from 
computations. 

Pearson. Distances are computed using one minus the Pearson product-moment 
correlation coefficient for each pair of objects. Use this metric for quantitative 
variables. Missing values are excluded from computations. 


Percent (available for hierarchical clustering only). Clustering uses a distance 
metric that is the percentage of comparisons of values resulting in disagreements 
in two profiles. Use this metric with categorical or nominal scales. 


Phi-square. Distances are computed as the phi-square (chi-square/total) measure 
on 2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric 
when the data are counts of objects or events. 


Rsquared. Distances are computed using one minus the square of the Pearson 
product-moment correlation coefficient for each pair of objects. Use this metric 
with quantitative variables. Missing values are excluded from computations. 


RT. Clustering uses the dissimilarity form of Rogers and Tanimoto’s similarity 
coefficient for categorical data. RT distance is available only for hierarchical 


clustering. 
Russel. Clustering uses the dissimilarity form of Russel’s similarity coefficient for 


binary data. Russel distance is available only for hierarchical clustering. 
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m 55. Clustering uses the dissimilarity form of Sneath and Sokal’s similarity 
coefficient for categorical data. SS distance is available only for hierarchical 
clustering. 


Mahalanobis 


In the Mahalanobis tab, you can specify the covariance matrix to compute Mahalanobis 
distance. 


Analyze: Cluster Analysis: Hierarchical 


Main | Options) Mahalanobis 
Covariance matrix 


© From data 
[Г] Groupina variable: ACCEL 


© From keyboard: 


& 


Q From file: 


Covariance matrix. Specify the covariance matrix to compute the Mahalanobis 
distance. Enter the covariance matrix either through the keyboard or from a SYSTAT 
file. Otherwise, by default SYSTAT computes the matrix from the data. Select a 
grouping variable for inter-group distance measures. 
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Options 
Analyze: Cluster Analysis: Hierarchical 


Cut cluster tree at Validity 
Енев | | рүвмовто 


El Leaf nodes: F Pseudo F 


Color clusters Бр--------- | [E] Pseudo T-square 
@) Length of terminal node: 


s | 


© Proportion of total nodes: 


mos 


The following options are available: 

Cut cluster tree at. You can choose the following options for cutting the cluster tree: 
m Height. Provides the option of cutting the cluster tree at a specified distance. 

m Leaf nodes. Provides the option of cutting the cluster tree by number of leaf nodes. 


Color clusters by. The colors in the cluster tree can be assigned by two different 
methods: 


m Length of terminal node. As you pass from node to node in order down the cluster 
tree, the color changes when the length of a node on the distance scale changes 
between less than and greater than the specified length of terminal nodes (on a scale 


of 0 to 1). 
= Proportion of tota 
in a cluster. 


Validity. Provides five validity indices to evaluate the partition quality. In particular, it 
is used to find out the appropriate number of clusters for the given data set. 


1 nodes. Colors are assigned based on the proportion of members 
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m RMSSTD. Provides root-mean-square standard deviation of the clusters at each 
step in hierarchical clustering. 

m Pseudo Е. Provides pseudo F-ratio for the clusters at each step in hierarchical 
clustering. 
Pseudo T-square. Provides pseudo T-square statistic for cluster assessment. 
DB. Provides Davies-Bouldin’s index for each hierarchy of clustering. This index 
is applicable for rectangular data only. 

m Dunn. Provides Dunn's cluster separation measure. 


m Maximum groups. Performs the computation of indices up to this specified number 
of clusters. The default value is the square-root of number of objects. 


K-Clustering Dialog Box 


K-Clustering dialog box provides options for K-Means clustering and K-Medians 
clustering. Both clustering methods split a set of objects into a selected number of 
groups by maximizing between-cluster variation relative to within-cluster variation. It 
is similar to doing a one-way analysis of variance where the groups are unknown and 
the largest F value is sought by reassigning members to each group. 

By default, the algorithms start with one cluster and splits it into two clusters by 
picking the case farthest from the center as a seed for a second cluster and assigning 
each case to the nearest center. It continues splitting one of the clusters into two (and 
reassigning cases) until a specified number of clusters are formed. The reassigning of 
cases continues until the within-groups sum of squares can no longer be reduced. The 
initial seeds or partitions can be chosen from a possible set of nine options. 


To open the K-Clustering dialog box, from the menus choose: 


Analyze 
Cluster Analysis 
K-Clustering... 
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Algorithm. Provides K-Means and K-Medians clustering options. 

m K-means. Requests K-Means clustering. 

m K-medians. Requests K-Medians clustering. 

Groups. Enter the number of desired clusters. Default number (Groups) is two. 
Iterations. Enter the maximum number of iterations. If not stated, the maximum is 20. 
Distance. Specifies the distance metric used to compare clusters. 


Save. Save provides three options to save either cluster identifiers, cluster identifiers 
along with data, or final cluster seeds, to a SYSTAT file. 


Mahalanobis. 


See the Mahalanobis tab in Hierarchical clustering. 
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Initial Seeds. 


To specify the initial seeds for clustering, click on the Initial Seeds tab. 


8% Analyze: Cluster Analysis: K-Clustering ДЕЗ 
Initial Seeds 


ONone 

OFitst k 

O Last k 

@ Random К 

© Random segmentation 
© Principal component 

O Hierarchical segmentation 


Linkage Броје 


О Partition variable: АЕ. 


О From Не: 


Random seed: 0 


The following initial seeds options are available: 


W None. Starts with one cluster and splits it into two clusters by picking the case 
farthest from the center as a seed for the second cluster and then assigning each 


case optimally. 
First K. Considers the first K non-missing cases as initial seeds. 


Last K. Considers the last K non-missing cases as initial seeds. 


Random K. Chooses randomly (without replacement) K non-missing cases as 


initial seeds. 


= Random segmentation. Assigns each case to any of К partitions randomly. 
Computes seeds from each initial partition taking the mean or the median of the 


observations, whichever is applicable. 
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= Principal component, Uses the first principal component as a single variable. Sorts 
all cases based on this single variable. It creates partitions taking the first n/K cases 
in the first partition, the next п/К cases іп the second partition and so on. 


m Hierarchical segmentation. Makes the initial А partitions from hierarchical 
clustering with the specified linkage method. 


Partition variable. Makes initial partitions from a specified variable. 

From file. Specify the SYSTAT file where seeds are written in case by case. 
Linkage. Specify the linkage method for hierarchical segmentation. 
Random seed. Specify the seed for random number generation. 


Additive Trees Clustering Dialog Box 


Additive trees were developed by Sattath and Tversky (1977) for modeling 
similarity/dissimilarity data, which hierarchical joining trees do not fit well. 
Hierarchical trees imply that all within-cluster distances are smaller than all between- 
cluster distances and that within-cluster distances are equal. This so-called 
“ultrametric” condition seldom applies to real similarity data from direct judgment. 
Additive trees, on the other hand, represent similarities with a network model in the 
shape of a tree. Distances between objects are represented by the lengths of the 
branches connecting them in the tree. 


To open the Additive Trees Clustering dialog box, from the menus choose: 


Analyze 
Cluster Analysis 
Additive Trees... 
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Cluster Analysis: Additive Trees ЕЕЗ 


At least three variables should be selected to perform Additive Tree Clustering. The 
following options can be specified: 


Data. Display the raw data matrix. 
Transformed. Include the transformed data (distance-like measures) with the output. 
Model. Display the model (tree) distances between the objects. 


Residuals. Show the differences between the distance-transformed data and the model 
distances. 


NoNumbers. Objects in the tree graph are not numbered. 


NoSubtract. Use of an additive constant. Additive Trees assumes interval-scaled data, 
which implies complete freedom in choosing an additive constant, so it adds or 
subtracts to exactly satisfy the triangle inequality. Use this NoSubtract option to allow 
strict inequality and not subtract a constant. 


Height. Prints the distance of each node from the root. 


MinVar. Combines the last few remaining clusters into the root node by searching for 
the root that minimizes the variances of the distances from the root to the leaves. 
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Using Commands 


For the Hierarchical tree method: 


CLUSTER 

USE filename 

IDVAR var$ 

SAVE filename / NUMBER-n DATA 

JOIN varlist / ROWS or COLUMNS or MATRIX POLAR DISTANCE=metric 
POWER=p COV=matrix or ‘filename’ GROUP=var 
LINKAGE=method RADIUS=r K=k BETA=b MAX=n 
VALIDITY= RMSSTD, CHF, PTS, DB, DUNN, HEIGHT=r, 
LEAF=n, LENGTH=r PROP=r 
SAMPLE = BOOT(m,n) or SIMPLE(m,n) or JACK 


The distance metric is ABSOLUTE, ANDERBERG, CHISQUARE, EUCLIDEAN, GAMMA, 
JACCARD, MAHALANOBIS, MINKOWSKI, PEARSON, PERCENT, PHISQUARE, 
RSQUARED, RT, RUSSEL, SS. For MINKOWSKI, specify the root using POWER=p. For 
COV=matrix, separate columns by space and separate rows by semicolon. Use 
GROUP=var, to compute inter-group distances. 

The linkage methods include AVERAGE, CENTROID, COMPLETE, MEDIAN, SINGLE, 
KNBD, UNIFORM, FLEXIBETA, WARD and WEIGHT. 


More than one validity index can be specified at a time. 


Resampling is available only in joining columns. 


For the K-Means clustering method: 


CLUSTER 
USE filename 


IDVAR var$ 
SAVE filename / NUMBER=n DATA : 
KMEANS varlist / NUMBER-n ITER-n DISTANCE-metric POWER-p 


COV=matrix or ‘filename’ GROUP-var 
INITIAL-option INIFILE-'filename' 
PARTITION-var LINKAGE-method 


The distance metric is ABSOLUTE, CHISQUARE, EUCLIDEAN, GAMMA, 
MAHALANOBIS, MINKOWSKI, MW, PEARSON, PHISQUARE or RSQUARED. For 


MINKOWSKI, specify the root using POWER=p. For COV=matrix, separate columns by 


space and separate rows by semicolon. Use GROUP-var, to compute inter-group 


distances. 
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The options for initial seeds are NONE, FIRSTK, LASTK, RANDOMK, RANDSEG, 
PCA and HIERSEG. Initial seeds can also be specified from a file or through a variable. 
For HIERSEG, specify the linkage method using LINKAGE-method. The linkage 
methods are mentioned below: 

AVERAGE, CENTROID, COMPLETE, MEDIAN, SINGLE, WARD, WEIGHTED. 


For the K-Medians clustering method: 


CLUSTER 

USE filename 

IDVAR var$ 

SAVE filename / NUMBER-n DATA or SEEDS 

KMEDIANS varlist / NUMBER-n ITER-n DISTANCE-metric POWER-p 
COV-matrix or 'filename' GROUP-var 
INITIAL-option INIFILE-'filename' 
PARTITION-var LINKAGE-method 


The distance metric is ABSOLUTE, CHISQUARE, EUCLIDEAN, GAMMA, 
MAHALANOBIS, MINKOWSKI, MW, PEARSON, PHISQUARE or RSQUARED. For 
MINKOWSKI, specify the root using POWER=p. For COV=matrix, separate columns by 


space and separate rows by semi colon. Use GROUP=var, to compute inter-group 
distances. 


The options for initial seeds are NONE, FIRSTK, LASTK, RANDOMK, RANDSEG, PCA 
and HIERSEG Initial seeds can also be specified from a file or through a variable. For 


HIERSEG, specify the linkage method using LINKAGE=method. The linkage methods 
are mentioned below: 


AVERAGE, CENTROID, COMPLETE, MEDIAN, SINGLE, WARD, WEIGHTED. 


For the Additive trees: 


CLUSTER 
USE filename 
ADD varlist / DATA TRANSFORMED MODEL RESIDUALS 


TREE NUMBERS NOSUBTRACT HEIGHT 
MINVAR ROOT = п, nj 
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Usage Considerations 


Types of data. Hierarchical Clustering works on either rectangular SYSTAT files or 
files containing a symmetric matrix, such as those produced with Correlations. 
K-Clustering works only on rectangular SYSTAT files. Additive Trees works only on 
symmetric (similarity or dissimilarity) matrices. 

Print options. PLENGTH options are effective only in Additive Trees. 

Quick Graphs. Cluster analysis includes Quick Graphs for each procedure. 
Hierarchical Clustering and Additive Trees have tree diagrams. For each cluster, 
K-Clustering displays a profile plot of the data, a parallel coordinates display and a 
display of the variable means and standard deviations. Also, K-Clustering produces a 
scatterplot matrix with different colors and symbols based on final cluster identifiers. 
To omit Quick Graphs, specify GRAPH NONE. 


Saving files. CLUSTER saves cluster indices as a new variable. 
BY groups. CLUSTER analyzes data by groups. 


Labeling output. For Hierarchical Clustering and K-Clustering, be sure to consider 
using the ID Variable (on the Data menu) for labeling the output. 
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Examples 

Example 1 

K-Means Clustering 
The data in the file SUBWORLD are a subset of cases and variables from the 
OURWORLD file: 
URBAN Percentage of the population living in cities 
BIRTH_RT Births per 1000 people 
DEATH_RT Deaths per 1000 people 
B то D Ratio of births to deaths 
BABYMORT Infant deaths during the first year per 1000 live births 
GDP CAP Gross domestic product per capita (in U.S. dollars) 
LIFEEXPM Years of life expectancy for males 
LIFEEXPF Years of life expectancy for females 
EDUC U.S. dollars spent per person on education 
HEALTH U.S. dollars spent per person on health 
MIL U.S. dollars spent per person on the military 
LITERACY Percentage of the population who can read 


The distributions of the economic variables (GDP CAP, EDUC, HEALTH, and MIL) 
are skewed with long right tails, so these variables are analyzed in log units. 


This example clusters countries (cases). 


The input is: 


CLUSTER 

USE SUBWORLD 

IDVAR COUNTRY$ 

LET (GDP CAP, EDUC, MIL, HEALTH) = L10(@) 

STANDARDIZE / SD 

KMEANS URBAN BIRTH RT DEATH RT BABYMORT LIFEEXPM, 
LIFEEXPF GDP CAP B TO D LITERACY EDUC, 
MIL HEALTH / NUMBER-4 


Note that KMEANS must be specified last. 
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The output is: 


Distance Metric is Euclidean Distance 


Single Linkage Method (Nearest Neighbor) 


K-Means splitting cases into 4 groups 


Summary Statistics for All Cases 


Variable Between SS 
URBAN 18.606 
BIRTH RT 26.204 
DEATH RT 23.663 
BABYMORT 26.028 
LIFEEXPM 24.750 
LIFEEXPF 25.927 
GDP_CAP 26.959 
B TO D 22.292 
LITERACY 24.854 
EDUC 25.371 
MIL 24.787 
HEALTH 24.923 
** TOTAL ** 294.362 


Cluster 1 of 4 Contains 12 Cases 


Members 


Belgium 
Denmark 
France 
Switzerland 


WGermany 
Poland 
Czechoslov 
Canada 


Cluster 2 of 4 Contains 5 Cases 


Members 
Case Distance 
Ethiopia 0.397 
Guinea 0.519 
Somalia 0.381 
Afghanistan 0.383 
Haiti 0.298 


Cluster Analysis 

df Within SS df  F-ratio 

3 9.394 25 16.506 

3 2.796 26 81.226 

3 5.337 26 38.422 

3 2.972 26 75.887 

3 4.250 26 50.473 

3 3.073 26 73.122 

3 2.041 26 114.447 

3 6.708 26 28.800 

3 4.146 26 51.947 

3 3.629 26 60.593 

3 3.213 25 64.289 

3 3.077 25 67.488 

36 50.638 309 
Statistics 

Standard 
Variable Minimum Mean Maximum Deviation 
URBAN 1.587 0.540 
BIRTH RT =1.137 -0.934 -0.832 0.105 
DEATH RT -0.770 0.000 0.257 0.346 
BABYMORT -0.852 -0.806 -0.676 0.052 
LIFEEXPM 0.233 0.745 0.988 0.230 
LIFEEXPF 0.430 0.793 1.065 0.182 
GDP CAP 0.333 1.014 1.275 0.257 
B TO D -1.092  -0.905 -0.462 0.180 
LITERACY 0.540 0.721 0.747 0.059 
EDUC 0.468 0.947 1.281 0.277 
MIL 0.285 0.812 1.109 0.252 
HEALTH 0.523 0.988 1.309 0.234 
Standard 
Variable Deviation 
URBAN 0.305 
BIRTH RT 0.102 
DEATH RT 0.757 
BABYMORT 0.440 
LIFEEXPM 0.557 
LIFEEXPF 0.447 
GDP CAP 0.300 
B TO D 0.258 
LITERACY 0.619 
EDUC 0.511 
MIL 0.173 
0.438 


HEALTH 
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Cluster 3 of 4 Contains 11 Cases 


Members 


Statistics 


Distance Variable Minimum 


Argentina 
Brazil 
Chile 
Colombia 
Uruguay 
Ecuador 
ElSalvad 


Guatemal 
Peru 
Panama 
Cuba 


ог 
а 


Cluster 4 of 4 Contains 2 Cases 


Memb: 


ers 


Mean Ма 


0.450 URBAN -0.885 0.157 
0.315 BIRTH_RT -0.603 0.070 
0.397 DEATH_RT -1.284 -0.700 
0.422 ВАВҮМОВТ -0.698 -0.063 
0.606 LIFEEXPM -0.628 0.057 
0.364 LIFEEXPF -0.569 0.042 
0.520 GDP CAP -0.753 -0.382 
0.646 B TOD -0.651 0.630 
0.369 LITERACY -0.943 0.200 
0.514 EDUC -0.888 -0.394 
0.576 MIL -1.250 -0.591 

HEALTH -0.911 -0.474 

Statistics 
Variable Minimum Mean Maximum 


URBAN -0.301 
BIRTH_RT 0.923 
DEATH RT -0.770 
BABYMORT 0.441 
LIFEEXPM -0.090 
LIFEEXPF -0.297 
GDP CAP -0.251 
B TO D 1.608 
LITERACY -0.943 
EDUC -0.037 
MIL 1.344 
HEALTH -0.512 


0.059 
1.267 
-0.770 
0.474 
-0.036 
-0.206 
0.053 
2.012 
-0.857 
0.444 
1.400 
-0.045 


0.418 
1.610 
-0.770 
0.507 
0.018 
-0.115 
0.357 
2.417 
720.771 
0.925 
1.456 
0.422 


Sta 
ximum Devi 


Standard 
Deviation 


0.508 
0.486 
0.000 
0.046 
0.076 
0.128 
0.430 
0.573 
0.122 
0.680 
0.079 
0.661 


ndard 
ation 
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Cluster Parallel Coordinate Plots 


1 


| 


--%------------- 
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Cluster SPLOM 


For each variable, cluster analysis compares the between-cluster mean square 
(Between SS/df) to the within-cluster mean square (Within SS/df) and reports the 
F-ratio. However, do not use these F-ratios to test significance because the clusters are 
formed to characterize differences. Instead, use these statistics to characterize relative 
discrimination. For example, the log of gross domestic product (GDP_CAP) and 
BIRTH_RT are better discriminators between countries than URBAN or DEATH_R T. 
For a good graphical view of the separation of the clusters, you might rotate the data 
using the three variables with the highest F-ratios. 

Following the summary statistics, for each cluster, cluster analysis prints the 
distance from each case (country) in the cluster to the center of the cluster. Descriptive 
statistics for these countries appear on the right. For the first cluster, the standard scores 
for LITERACY range from 0.54 to 0.75 with an average of 0.72. В TO Р ranges from 
–1.09 to -0.46. Thus, for these predominantly European countries, literacy is well 
above the average for the sample and the birth-to-death ratio is below average. In 
cluster 2, LITERACY ranges from —2.27 to 0.76 for these five countries, and В TO D 
ranges from —0.38 to 0.25. Thus, the countries in cluster 2 have a lower literacy rate 
and a greater potential for population growth than those in cluster 1. The fourth cluster 
(Iraq and Libya) has an average birth-to-death ratio of 2.01, the highest among the four 
clusters. 
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Cluster Parallel Coordinates 


The variables in this Quick Graph are ordered by their F-ratios. In the top left plot, 
there is one line for each country in cluster 1 that connects its z scores for each of the 
variables. Zero marks the average for the complete sample. The lines for these 12 
countries all follow a similar pattern: above average values for GDP CAP, below 
average for BIRTH. RT and so on. The lines in cluster 3 do not follow such a tight 


pattern. 


Cluster Profiles 


The variables in cluster profile plots are ordered by the F-ratios. The vertical line under 
each cluster number indicates the grand mean across all data. A variable mean within 
each cluster is marked by a dot. The horizontal lines indicate one standard deviation 
above or below the mean. The countries in cluster І have above average means of gross 
domestic product, life expectancy, literacy, and urbanization, and spend considerable 
money on health care and the military, while the means of their birth rates, infant 
mortality rates, and birth-to-death ratios are low. The opposite is true for cluster 2. 


Scatterplot Matrix 


In the scatterplot matrix (SPLOM), the off-diagonal cells are the scatterplot of two 
variables at a time and the diagonal cells are the histogram of variables. The off- 
diagonal cells in the SPLOM are such that observations belonging to the same cluster 


will have the same color and symbol. 


K-Medians Cluster Analysis with Subworld 


The input is: 


CLUSTER 
USE SUBWORLD 
IDVAR COUNTRY$ 
LET (GDP CAP, EDUC, MIL, HEALTH) - 110 (0) 
STANDARDIZE / SD 
KMEDIANS URBAN BIRTH_RT DEATH_RT BABYMORT LIFEEXPM, 
LIFEEXPF GDP CAP B TO D LITERACY EDUC, 
MIL HEALTH 7 DISTANCE -ABSOLUTE NUMBER-4 
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The output is: 


Distance Metric is Absolute Distance 


Single Linkage Method (Nearest Neighbor) 
K-Medians splitting cases into 4 groups 


Summary Statistics for 4 Clusters 


Variable 


URBAN 
BIRTH_RT 
DEATH_RT 
BABYMORT 
LIFEEXPM 
LIFEEXPF 
GDP CAP 
втор 
LITERACY 
EDUC 

MIL 

HEALTH 

** TOTAL ** 


Within Sum of 
Absolute 
Deviation 


Cluster 1 of 4 Contains 12 Cases 


Members 
Case Distance 
Austria 0.142 
Belgium 0.035 
Denmark 0.093 
France 0.114 
Switzerland 0.206 
UK 0.075 
Italy 0.163 
Sweden 0.132 
WGermany 0.130 
Poland 0.384 
Czechoslov 0.218 
Canada 0.224 


Variable 


URBAN 
BIRTH RT 
DEATH RT 
BABYMORT 
LIFEEXPM 
LIFEEXPF 
GDP CAP 
B TO D 
LITERACY 
EDUC 

MIL 
HEALTH 


Cluster 2 of 4 Contains 5 Cases 


Variable 


URBAN 

BIRTH RT 
DEATH RT 
BABYMORT 


Members 
Case Distance 
Argentina 0.169 
Chile 0.102 
Uruguay 0.185 
Panama 0.453 
Cuba 0.240 


LIFEEXPM 
LIFEEXPF 
GDP CAP 
B TO D 
LITERACY 
EDUC 

MIL 
HEALTH 


0.285 
0.523 


Minimum 


Statistics 
Median Maximum 
0.643 1.587 
-0.946 -0.832 
0.257 0.257 
-0.830 -0.676 
0.772 0.988 
0.838 1.065 
1.079 1.275 
-0.949 -0.462 
0.747 0.747 
0.959 1.281 
0.847 1.109 
1.007 1.309 
Statistics 
Median Maximum 
1.002 1.137 
-0.374 0.084 
-0.770 0.000 
-0.479 -0.260 
0.449 0.772 
0.430 0.611 
-0.216 0.037 
-0.102 1.554 
0.575 0.730 
-0.175 0.135 
-0.368 0.371 
-0.243 0.284 


Mean Absolute 
Deviation 


Mean Absolute 
Deviation 


0.431 
0.183 
0.411 
0.105 
0.172 
0.337 
0.345 
0.867 
0.271 
0.487 
0.498 
0.525 


Cluster 3 of 4 Contains 5 Cases 


Variable 
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Members 
Case Distance 
Ethiopia 0.216 
Guinea 0.433 
Somalia 0.266 
Afghanistan 0.352 
Haiti 0.202 


URBAN 
BIRTH RT 
DEATH RT 
BABYMORT 
LIFEEXPM 
LIFEEXPF 
GDP CAP 
B TO D 
LITERACY 
EDUC 

MIL 
HEALTH 


Cluster 4 of 4 Contains 8 Cases 


Members 
Case Distance 
Iraq 0.585 
Libya 0.659 
Brazil 0.263 
Colombia 0.364 
Ecuador 0.160 
ElSalvador 0.215 
Guatemala 0.343 
Peru 0.301 


Variable 


URBAN 
BIRTH RT 
DEATH RT 
BABYMORT 
LIFEEXPM 
LIFEEXPF 
GDP CAP 
B TO D 
LITERACY 
EDUC 

MIL 
HEALTH 


-2.222 


Minimum 


Statistics 
Mean Absolute 
Median Maximum Deviation 
-1.783 -1.289 0.234 
1.534 1.687 0.076 
1.540 3.081 0.513 
1.778 2.414 0.342 
-1.814 -1.383 0.388 
71.749 -1.477 0.657 
-1.701 -1.270 0.529 
0.050 0.252 0.494 
-1.978 -0.764 0.686 
-1.563 -1.096 0.649 
-1.450 -1.374 0.372 
-1.520 -1.290 0.670 

Statistics 


Mean Absolute 
Deviation 


-0.031 0.418 0.511 
0.542 1.610 0.410 
-0.770 -0.257 0.160 
0.408 0.551 0.159 
-0.305 0.233 0.229 
-0.297 0.157 0.136 
-0.579 0.357 0.275 
1.158 2.417 0.512 
-0.236 0.368 0.733 
-0.487 0.925 0.680 
-0.838 1.456 0.960 
-0.721 0.422 0.545 
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Cluster Paralle! Coordinate Plots 
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Scatter Plot Matrix 


Cluster SPLOM 


2 2 
hical Clustering: Clustering Cases 


This example uses the SUBWORLD data (see the K-Means example for a description) 
to cluster cases. 


The input is: 


CLUSTER 
USE SUBWORLD 
IDVAR COUNTRY$ 
LET (GDP_CAP, EDUC, MIL, HEALTH) = L10(@) 
STANDARDIZE / SD 
JOIN URBAN see RT DEATH_RT BABYMORT LIFEEXPM, 
LIFEEXPF GDP_CAP B_TO_D LITERACY EDUC MIL HEALTH 


The output is: 
/ Distance Metric is Euclidean Distance 
Single Linkage Method (Nearest Neighbor) 
Eum; ol; Clusters Joining . at Distance; | Нол of{ Members. 

пути Rude Se ee ВЕКА : 
WGermany Belgium 0.087 
WGermany Denmark 0.111 j 
WGermany UK 0.113 5 
Sweden WGermany 0.128 5 
Austria Sweden 0.161 У 
Austria France 0.194 i 
Austria Italy 0.194 
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Austria Canada 0.211 9 
Uruguay Argentina 0.215 2 
Switzerland Austria 0.236 10 
Czechoslov Poland 0.241 2 
Switzerland Czechoslov 0.260 12 
Guatemala ElSalvador 0.315 2 
Guatemala Ecuador 0.316 3 
Uruguay Chile 0.370 3 
Cuba Uruguay 0.374 4 
Haiti Somalia 0.397 2 
Switzerland Cuba 0.403 16 
Guatemala Brazil 0.417 4 
Peru Guatemala 0.421 5 
Colombia Peru 0.443 6 
Ethiopia Haiti 0.474 3 
Panama Colombia 0.516 Y 
Switzerland Panama 0.556 23 
Libya Iraq 0.570 2 
Afghanistan Guinea 0.583 2 
Ethiopia Afghanistan 0.597 5 
Switzerland Libya 0.860 25 
Switzerland Ethiopia 0.908 30 


Cluster Tree 


П АРА АНА 


Tu Л ЕГ «LIB РЕ: s) 


0.0 0.1 02 03 04 05 06 07 08 09 10 
Distances 


The numerical results consist of the joining history. The countries at the top of the 
panel are joined first at a distance of 0.087. The last entry represents the joining of the 
largest two clusters to form one cluster of all 30 countries. Switzerland is in one of the 
clusters and Ethiopia is in the other. 

The clusters are best illustrated using a tree diagram. Because the example joins 
rows (cases) and uses COUNTRY as an ID variable, the branches of the tree are labeled 
with countries. If you join columns (variables), then variable names are used. The scale 
for the joining distances is printed at the bottom. Notice that Iraq and Libya. which 
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form their own cluster as they did in the K-Means example, are the second-to-last 
cluster to link with others. They join with all the countries listed above them at a 


distance of 0.860. Finally, at a distance of 0.908, the five countries at the bottom of the 
display are added to form one large cluster. 


Polar Dendrogram 


Adding the POLAR option to JOIN yields a polar dendrogram. 


Cluster Tree 


ЖЕСЕ СКАЈ 
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Example 3 
Hierarchical Clustering: Clustering Variables 


This example joins columns (variables) instead of rows (cases) to see which variables 
cluster together. 


The input is: 


CLUSTER 

USE SUBWORLD 

IDVAR COUNTRYS 

LET (GDP САР, EDUC, MIL, HEALTH) = L10(@) 

STANDARDIZE / SD 

JOIN URBAN BIRTH_RT DEATH_RT BABYMORT LIFEEXPM, 
LIFEEXPF GDP_CAP B TO D LITERACY, 
EDUC MIL HEALTH / COLUMNS DISTANCE=PEARSON 


The output is: 


Distance Metric is 1-Pearson Correlation Coefficient 
Single Linkage Method (Nearest Neighbor) 


Clusters Joining at Distance No. of Members 
LIFEEXPF LIFEEXPM 0.011 2 
HEALTH GDP_CAP 0.028 2 
EDUC HEALTH 0.038 3 
LIFEEXPF LITERACY 0.074 3 
BABYMORT BIRTH RT 0.077 2 
EDUC LIFEEXPF 0.102 6 
MIL EDUC 0.120 7 
MIL URBAN 0.165 B 
B TO D BABYMORT 0.358 3 
B TO D DEATH RT 0.365 4 
B TO D MIL 1.279 12 
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Cluster Tree 


DEATH_RT 
ВАВУМОРТ 
BIRTH RT 
BTOD 
URBAN 
LITERACY 
LIFEEXPF 
LIFEEXPM 
СОР САР 
HEALTH 
EDUC 
ML 

Aig a: Сегіз уел 


0.0 0.5 10 15 
Distances 


The scale at the bottom of the tree for the distance (1-г) ranges from 0.0 to 1.5. The 
smallest distance is 0.011—thus, the correlation of LIFEEXPM with LIFEEXPF is 


0.989. 


Example 4 
Hierarchical Clustering: Clustering Variables and Cases 


To produce a shaded display of the original data matrix in which rows and columns are 
permuted according to an algorithm in Gruvaeus and Wainer (1972), use the MATRIX 
option. Different shadings or colors represent the magnitude of each number in the 
matrix (Ling, 1973). 

If you use the MATRIX option with Euclidean distance, be sure that the variables are 
on comparable scales because both rows and columns of the matrix are clustered. 
Joining a matrix containing inches of annual rainfall and annual growth of trees in feet, 
for example, would split columns more by scales than by covariation. In cases like this, 
you should standardize your data before joining. 
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The input is: 
CLUSTER 
USE SUBWORLD 
IDVAR COUNTRYS 
LET (GDP CAP, EDUC, MIL, HEALTH) = L10(9) 
STANDARDIZE / SD 
JOIN URBAN BIRTH_RT DEATH_RT BABYMORT LIFEEXPM, 
LIFEEXPF GDP CAP В TO D LITERACY EDUC, 
MIL HEALTH 7 MATRIX 
The output is: 


Distance Metric is Euclidean Distance 
Single Linkage Method (Nearest Neighbor) 


Permuted Data Matrix 


oa NWS 


ФУ 


Index of Variable 


Шаны 
BEERS S088 


“~ 


This clustering reveals three groups of countries and two groups of variables. The 
countries with more urban dwellers and literate citizens, longest life-expectancies, 
highest gross domestic product, and most expenditures on health care, education, and 
the military are on the top left of the data matrix; countries with the highest rates of 
death, infant mortality, birth, and population growth (see B_TO_D) are on the lower 
right. You can also see that, consistent with the KMEANS and JOIN examples, Iraq 
and Libya spend much more on military, education, and health than their immediate 
neighbors. 
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Example 5 
Hierarchical Clustering: Distance Matrix Input 


This example clusters a matrix of distances. The data, stored as a dissimilarity matrix 
in the CITIES data file, are airline distances in hundreds of miles between 10 global 
cities. The data are adapted from Hartigan (1975). 


The input is: 


CLUSTER 
USE CITIES 
JOIN BERLIN BOMBAY CAPETOWN CHICAGO LONDON, 
MONTREAL NEWYORK PARIS SANFRAN SEATTLE 


The output is: 


Single Linkage Method (Nearest Neighbor) 


Clusters Joining at Distance No. of Members 
PARIS LONDON 2.000 2 
NEWYORK MONTREAL 3.000 2 
BERLIN PARIS 5.000 3 
CHICAGO NEWYORK 7.000 3 
SEATTLE SANFRAN 7.000 2 
SEATTLE CHICAGO 17.000 5 
BERLIN SEATTLE 33.000 8 
BOMBAY BERLIN 39.000 9 
BOMBAY CAPETOWN 51.000 10 

Cluster Tree 
SANFRAN 

SEATTLE 
CHICAGO 
NEWYORK 

MONTREAL 
LONDON 
PARIS 
BERLIN — 
BOMBAY 
CAPETOWN 
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The tree is printed in seriation order. Imagine a trip around the globe to these cities. 
SYSTAT has identified the shortest path between cities. The itinerary begins at San 
Francisco, leads to Seattle, Chicago, New York, and so on, and ends in Capetown. 

Note that the CITIES data file contains the distances between the cities; SYSTAT 
did not have to compute those distances. When you save the file, be sure to save it as 
a dissimilarity matrix. 

This example is used both to illustrate direct distance input and to give you an idea 
of the kind of information contained in the order of the SYSTAT cluster tree. For 
distance data, the seriation reveals shortest paths; for typical sample data, the seriation 
is more likely to replicate in new samples so that you can recognize cluster structure. 


Example 6 
Density Clustering Examples 


K-th Nearest Neighbor Density Linkage Clustering 


The data file CARS is used for analysis of Hierarchical Clustering using K-th Nearest 
Neighbor density linkage clustering. 


The variables іп the CARS data which are used for analysis are ACCEL, BRAKE, 
SLALOM, MPG and SPEED. 


The input is: 


CLUSTER 
USE CARS 
IDVAR МАМЕ5 
STANDARDIZE ACCEL BRAKE SLALOM MPG SPEED 
JOIN ACCEL BRAKE SLALOM MPG SPEED/LINKAGE-KNBD K=3 


The output is: 


Distance Metric is Euclidean Distance 
KNBD Density Linkage Method for K = 3 


Clusters Joining at Distance No. 


BMW 635 Saab 9000 


VW Fox GL Chevy Nova 
VW Fox GL Civic СЕХ 


0.91 2 
Toyota Supra BMW 635 0:914 3 
Testarossa Porsche 911T 2.715 2 
Corvette Testarossa 2.715 3 
Mercedes 560 Toyota Supra 2.808 4 
Mercedes 560 Acura Legend 2.274 5 
Corvette Mercedes 560 3.309 8 
VW Fox GL Corvette 7.320 9 

8. 

1. 


бод м ке нара 
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Cluster Tree 


Civic CRX 
Mercedes 560 
‘Saab 9000 
ВМ//635 
Toyota Supra 
Acura Legend 
Cornette 
Porsche 911T 
Testarossa 
WFoxGL 
Chevy Мома 


SSS ee eee | 
0 10 20 30 4C 
Distances 


Uniform Kernel Density Linkage Clustering 


The data file CARS is used for analysis of Hierarchical Clustering using Uniform 
Kernel density linkage clustering. 


The variables in CARS data which are used for analysis are ACCEL, BRAKE, SLALOM, 
МРС and SPEED. 


The input is: 


CLUSTER 
USE CARS 
IDVAR NAMES 
STANDARDIZE ACCEL BRAKE SLALOM MPG SPEED 
JOIN ACCEL BRAKE SLALOM MPG SPEED/LINKAGE-UNIFORM RADIUS-1.2 
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The output is: 


Distance Metric is Euclidean Distance 
Uniform Density Linkage Method for Radius = 1.200 


Clusters Joining at Distance No. of Members 
BMW 635 Toyota Supra 18.010 2 
BMW 635 Porsche 911T 19.296 3 
Saab 9000 BMW 635 19.296 4 
Acura Legend Saab 9000 19.296 5 
Acura Legend Corvette 21.011 6 
VW Fox GL Acura Legend 21.011 7 
VW Fox GL Mercedes 560 23.413 8 
VW Fox GL Testarossa 34.304 9 
Civic CRX VW Fox GL 34.304 10 
Сћеуу Мома Civic CRX 34.304 11 
Cluster Tree 
Testarossa 
Mercedes 560 
Acura Legend 
Saab 9000 
BMW635 
Toyota Supra 
Porsche 911T 
Conette 
WFoxGL. 
Civic CRX 
Chew Nova 


Distances 
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Example 7 
Flexible Beta Linkage Method for Hierarchical Clustering 


The data file CARS is used for the analysis of Hierarchical Clustering using Flexible 
beta linkage clustering. 

The variables in CARS data which are used for analysis are ACCEL, BRAKE, 
SLALOM, MPG and SPEED. 


The input is: 


CLUSTER 
USE CARS 
IDVAR МАМЕ5 
STANDARDIZE ACCEL BRAKE SLALOM MPG SPEED 
JOIN ACCEL BRAKE SLALOM MPG SPEED/LINKAGE=FLEXIBETA BETA=-0.25 


The output is: 


Distance Metric is Euclidean Distance 
Flexible Beta Linkage Method for Beta = -0.250 


Clusters Joining at Distance No. of Members 
Corvette Porsche 911T 0.373 2 
BMW 635 Saab 9000 0.392 2 
Toyota Supra BMW 635 0.563 3 
Corvette Testarossa 0.746 3 
Mercedes 560 Toyota Supra 1.013 4 
Chevy Nova Acura Legend 1.038 2 
VW Fox GL Civic CRX 1.161 2 
VW Fox GL Chevy Nova 1.339 4 
Corvette Mercedes 560 1.842 7 
VW Fox GL Corvette 2.997 11 


Cluster Tree 
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Example 8 
Validity indices RMSSTD, Pseudo F, and Pseudo T-square with cities 


In this example we һауе used the CITIES data file for the analysis of Hierarchical 
clustering for the validity of RMSSTD, PSEUDO F and PSEUDO T-SQUARE. 

This analysis specifies how many good partitions can be made for the given data in 
hierarchical clustering. 


The input is: 


CLUSTER 
USE CITIES 
JOIN/LINKAGE=CENTROID VALIDITY = RMSSTD CHF PTS MAX=9 


The output is: 
Centroid Linkage Method 

Clusters Joining at Distance No. of Members RMSSTD Pseudo F 
PARIS LONDON 2.000 2 1.000 25.350 
NEWYORK MONTREAL 3.000 2 1.225 23.006 
BERLIN PARIS 5.000 3 1.472 16.969 
CHICAGO NEWYORK 6.750 3 i732 14.978 
SEATTLE SANFRAN 7.000 2 1.871 17.166 
SEATTLE CHICAGO 18.583 5 2.820 9.280 
BERLIN SEATTLE 35.929 8 3.845 3.392 
CAPETOWN BOMBAY 51.000 2 5.050 4.639 
CAPETOWN BERLIN 46.750 10 4.759 О 


Pseudo T-square 
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Cluster Tree 


Validity Index Plot 
LEE NUR BLE. x pe ии лк, 
Е, \ 
= 


мм» 


o123 46678 Ж 
Narter d Clusters. 


We observe that there is a *knee" in the RMSSTD plot at 5, a jump in the plot of the 
pseudo 7-square also at 5 and a peak at the same point in the graph of pseudo F. Hence 
the appropriate number of clusters appears to be 5. In some data sets the indices may 
not all point to the same clustering; you must then choose the appropriate clustering 


scheme based on the type of data. 
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Example 9 
Hierarchical Clustering with Leaf Option 


In this example we have used the /R/S data file for leaf option analysis in hierarchical 
clustering. The IRIS data file contains the following variables: SPECIES, SEPALLEN, 
SEPALWID, PETALLEN and PETALWID. \t becomes difficult to understand the 
substructures of the data from the cluster trees when there are a large number of 
objects. In such cases, the LEAF option helps the user to concentrate on the upper part 
of the tree. In the following example with LEAF =13, SYSTAT provides another tree 
with 13 leaf nodes along with a partition table. The table shows the content of each 
node. 


The input is: 


CLUSTER 
USE IRIS 
JOIN SEPALLEN SEPALWID PETALLEN PETALWID/LINKAGE=WARD LEAF=13 


The following is a part of the output: 


Distance Metric is Euclidean Distance 
Ward Minimum Variance Method 


Cluster Tree 


SS 


| 


о 10 20 3) 40 50 60 70 80 90 10 
Distances 


Cluster Tree and Partition Table for LEAF = 13 


L119 


Cluster Analysis 


Cluster Tree 


OT ee ee 
о 10 20 9 40 9 60 70 80 90 100 
Distances 


Nodel Node2 Node3 Node4 Node5 


Node13 
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Example 10 
Additive Trees 


These data are adapted from an experiment by Rothkopf (1957) in which 598 subjects 
were asked to judge whether Morse code signals presented two in succession were the 
same. АП possible ordered pairs were tested. For multidimensional scaling, the data for 
letter signals is averaged across the sequence and the diagonal (pairs of the same 
signal) is omitted. The variables are A through Z. 


The input is: 
CLUSTER 
USE ROTHKOPF 
АРРА .. Z 
The output is: 


Similarities linearly transformed into distances 
77.000 needed to make distances positive 

104.000 added to satisfy triangle inequality 
Checking 14950 quadruples 

Checking 1001 quadruples 

Checking 330 quadruples 

Checking 70 quadruples 

Checking 1 quadruples 


Stress Formula 1 : 0.061 
Stress Formula 2 : 0.399 
R-squared (Monotonic) : 0.841 


R-squared (Present Value Annuity Factor) : 0.788 


1 23.396 A 
2 15.396 В 
3 14.813 С 
4 13.313 D 
5 24.125 E 
6 34.837 F 
2 15.917 G 
8 27.875 H 
9 25.604 1 
10 19.833 Ј 
11 13.688 к 
12 28.620 L 
13 21.813 М 
14 22.188 N 
15 19.083 0 
16 14.167 P 
17 18.958 0 
18. 21.438 А 
19 28.000 85 
20 23.875 T 
21 23.000 U 
22 27.125 У 
W 

x 

Y 

2 

1 

2 


no 
= 


29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 


15.716 
19.583 
26.063 
23.843 
6.114 
17.175 
18.807 
13.784 
15.663 
8.886 
4.562 
1.700 
8.799 
4.180 
1.123 
5.049 
2.467 
4.585 
2.616 
2.730 
0.000 
3.864 
0.000 


3,25 
4,11 

5,20 

7,15 

8,22 

10,16 
13,14 
17,26 
18,23 
19,21 
27,35 
29,36 
33,38 
39,31 
12,28 
34,40 
42,41 
30,43 
32,44 
6,37 

45,48 
46,47 
50,49 
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(SYSTAT also displays the raw data, as well as the model distances.) 


Additive Tree 
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Computation 


Algorithms 


JOIN follows the standard hierarchical amalgamation method described in Hartigan 
(1975). The algorithm in Gruvaeus and Wainer (1972) is used to order the tree. The 
K?-Nearest Neighborhood method and the Uniform Kernel method use the algorithm 
prescribed in Wong and Lane (1983). 

KMEANS follows the algorithm described in Hartigan (1975). Its speed can be 
improved using modifications proposed by Hartigan and Wong (1979). There is an 
important difference between SYSTAT's KMEANS algorithm and implementations of 
Hartigan's algorithm in BMDP, SAS, and SPSS: in SYSTAT, by default, seeds for 
new clusters are chosen by finding the case farthest from the centroid of its cluster; in 
Hartigan's algorithm, seeds forming new clusters are chosen by splitting on the 
variable with largest variance. KMEDIANS essentially follows the same algorithm but 
uses the median instead of the mean. The median is determined by a modification of 
binary search. 


Missing Data 


In cluster analysis, all distances are computed with pairwise deletion of missing values. 
Since missing data are excluded from distance calculations by pairwise deletion, they 
do not directly influence clustering when you use the MATRIX option for JOIN. To use 
the MATRIX display to analyze patterns of missing data, create a new file in which 
missing values are recoded to 1, and all other values to 0. Then use JOIN with MATRIX 
to see whether missing values cluster together in a systematic pattern. 
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Chapter 


Conjoint Analysis 


Leland Wilkinson 


Conjoint analysis fits metric and nonmetric conjoint measurement models to observed 
data. It is designed to be a general additive model program using a simple 
optimization procedure. As such, conjoint analysis can handle measurement models 
not normally amenable to other specialized conjoint programs. 

Resampling procedures are available in this feature. 


Statistical Background 


Conjoint measurement (Luce and Tukey, 1964; Krantz, 1964; Luce, 1966; Tversky, 
1967: Krantz and Tversky, 1971) is an axiomatic theory of measurement that defines 
the conditions under which there exist measurement scales for two or more variables 
that jointly define a common scale under an additive composition rule. This theory 
became the basis for a group of related numerical techniques for fitting additive 
models, called conjoint analysis (Green and Rao, 1971; Green et al., 1972; Green and 
DeSarbo, 1978; Green and Srinivasan, 1978, 1990; Louviere, 1988, 1994). For an 
interesting historical comment on Sir Ronald Fisher’s “appropriate scores” method 
for fitting additive models, see Heiser and Meulman (1995). 

To see how conjoint analysis is based on additive models, we will first graph an 
additive table and then examine a multiplicative table to encounter one example of a 
non-additive table. Then we will consider the problem of computing margins of a 


general table based on an additive model. 


1174 
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Additive Tables 


The following is an additive table. Notice that any cell (in roman) is the swm of the 
corresponding row and column marginal values (in italic). 


оо ~ 
ә tu ~ 
wt AD м 
4 € o0 - w 


A common way to represent a two-way table like this is with a graph. We make a file 
(PCONJ.SYZ) containing all possible ordered pairs of the row and column indices. Then 
we form Y values by adding the indices. 


The input is: 


USE PCONJ 
LET Y-A«B 
LINE Y*A/GROUP-B, OVERLAY 


The following graph of the additive table shows a plot of Y (the values in the cells) 
against A (rows) stratified by B (columns) in the legend. Notice that the lines are 
parallel. 
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The output is: 


8 

7 

6 

5 

> 

4 

3 B 
а 

5 2 

А "13 

0 1 2 3 4 5 


Since we really have a three-dimensional graph (У%4%8В), it is sometimes convenient 
to represent a two-way table as a 3-D or contour plot rather than as a stratified line 


graph. 


The input is: 


PLOT Y*A*B/SMOOTH-QUAD, CONTOUR, 
XMIN-0, XMAX-4 , YMIN=0, YMAX=5 , TICK- INDENT 


The following contour plot of the additive table shows the result. Notice that the lines 
in the contour plot are parallel for additive tables. Furthermore, although we used a 
quadratic smoother, the contours are linear because we used a simple linear 
combination of 4 and В to make Y. 
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The output is: 


Multiplicative Tables 


The following is a multiplicative table. Notice that any cell is the product of the 
corresponding marginal values (in italic). We commonly encounter these tables in 
cookbooks (for sizing recipes) or in, well, multiplication tables. These tables are one 
instance of two-way tables that are not additive. 


1 2 2 
Etait 8 12 
343 6 9 
42/79 4 6 
1 1 2 3 


Let us look at a graph of this multiplicative table: 


LET Y=A*B 
LINE Y*A/GROUP=B, OVERLAY 


Notice that the lines are not parallel. 
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And the following figure shows the contour plot for the multiplicative model. Notice, 
again, that the contours are not parallel. 


0 1 2 3 4 


Multiplicative tables and graphs may be pleasing to look at, but they are not simple. 
We all learned to add before multiplying. Scientists often simplify multiplicative 
functions by logging them, since logs of products are sums of logs. This is also one of 
the reasons we are told to be suspicious of fan-fold interactions (as in the line graph of 


the multiplicative table) in the analysis of variance. If we can log the variables and 
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remove them (usually improving the residuals іп the process), we should do so because 
it leaves us with a simple linear model. 


Computing Table Margins Based on an Additive Model 


If we believe in Occam’s razor and assume that additive tables are generally preferable 
to non-additive, we may want to fit additive models to a table of numbers before 
accepting a more complex model. So far, we have been assuming that the marginal 
indices are known. Testing for additivity is simply a matter of using these indices in a 
formal model. What if the marginal indices are not known? All we have is a table of 
numbers bordered by labeled categories. Can we find marginal values such that a linear 
model based on these values would reproduce the table? 

This is exactly what conjoint analysis does. Conjoint analysis originated in an 
axiomatic approach to measurement (Luce and Tukey, 1964). An additive model 
underlies a basic axiom of “fundamental measurement"— scale values of separate 
measurements can be added to produce a joint measurement. This powerful property 
allows us to say that for all measurements a and b, we have made on a set of objects, 
(a+b)>a and (a + b) > Б, assuming that а and b are positive. 

The following table is an example of such data. How do we find values for а; and 
b; such that y; = a, + bj. Luce and Tukey devised rules for computing these values 
assuming that the cell values can be fit by the additive model. 


ы b2 53 
a4 1:8 2.07 2.48 
a3 110 1.79 2.20 
а2 69 1.38 1.79 
al 00 69 1.10 


The following figure shows a solution. The values for a are a, = 0.00, a» = 0.69, 
a, = 1.10, and а, = 1.38. The values for b аге b, = 0.00, b; = 0.69, and 
b, 1-2 1.10. 
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Applied Conjoint Analysis 


In the last few decades, conjoint analysis has become popular, especially among 
market researchers and some economists, for analyzing consumer preferences for 
goods based on multiple attributes. Green and Srinivasan (1978, 1990), Crowe (1980), 
and Louviere (1988) summarize this activity. The focus of most of these techniques has 
been on the development of products with attributes ideally suited to consumer 
preferences. Several trends in this area have been apparent. 

First, psychometricians decided that the axiomatic approach was impractical for 
large data sets and for data in which the conjoint measurement axioms were violated 
or contained errors (for example, Emery and Barron, 1979). This trend was partly a 
consequence of the development of numerical methods that could fit conjoint models 
nonmetrically (Kruskal, 1965; Kruskal and Carmone, 1969; Srinivasan and Shocker, 
1973; DeLeew et al., 1976). Green and Srinivasan (1978) coined the term conjoint 
analysis for the application of these numerical methods. 

Second, applied researchers began to substitute linear methods (usually least- 
squares linear regression or ANOVA) for nonmetric algorithms. The justification for 
this was usually practical—the results appeared to be similar for all of the fitting 
methods, so why not use the simple linear ones? Louviere (1988) articulates this 
position, partly based on results from Green and Srinivasan (1978) and partly from his 
own experience with real data sets. This argument is similar to one made by Weeks and 
Bentler (1979), in which multidimensional scalings using a linear distance function 
produced configurations almost indistinguishable from those using monotonic or 
moderately nonlinear distance functions. This is a rather ad hoc conclusion, however, 
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and does not justify ignoring possible nonlinearities in the modeling process. We will 
look at such a case in the examples. 

Third, recent conjoint analysis applied methodology has moved toward designing 
experiments rather than analyzing received ratings. Green and Srinivasan (1990) and 
Louviere (1991) have pioneered this approach. Response surfaces for fractional 
designs are analyzed to identify optimal combinations of product features. In 
SYSTAT, this approach amounts to using DESIGN for setting up an experimental 
design and then GLM for analyzing the results. With PLENGTH LONG, least-squares 
means are produced for factorial designs. Otherwise, response surfaces can be plotted. 

Fourth, discrete choice logistic regression has recently emerged as a rival to conjoint 
analysis for modeling choice and preference behavior (Hensher and Johnson, 1981). 
Steinberg (1992) describes the advantages and limitations of this approach. The LOGIT 
procedure in SYSTAT offers this method. 

Finally, a commercial industry supplying the practical tools for conjoint studies has 
produced a variety of software packages. Oppewal (1995) reviews some of these. In 
many cases, more efforts are devoted to “card decks” and other stimulus materials 
management than to the actual analysis of the models. CONJOINT in SYSTAT 
represents the opposite end of the spectrum from these approaches. CONJOINT presents 
methods for fitting these models that are inspired more by Luce and Tukey’s and Green 
and Rao’s original theoretical formulations than by the practical requirements of data 
collection. The primary goal of SYSTAT CONJOINT is to provide tools for scaling 
small- to moderate-sized data sets in which additive models can simplify the 
presentation of data. Metric and nonmetric loss functions are available for exploring 
the effects of nonlinearity on scaling, The examples highlight this distinction. 
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Conjoint Analysis in SYSTAT 


Conjoint Analysis Dialog Box 


To open the Conjoint Analysis dialog box, from the menus choose: 


Advanced 
Conjoint Analysis... 


ІЛ. Advanced:Conjoint Analysis 


Available variable(s): Dependent(s]: 
CARD —— | «Required» 
DESIGN$ Add ~> 
BRAND$ A 
PRICE Ls Bemove | 
SEAL$ 
GUARANT$ Independents}: 
RESPONSE ут ‚ [ «Required» 


<- Remove 


uada АШИ 


Iterations: 


Convergence: 
Polarity 
© Positive 
©) Negative 


[Г] Save estimates: 


Model selection and estimation are available in the Model tab of the Conjoint Analysis 


dialog box. 
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Dependent(s). Select the variable(s) you want to examine. The dependent variable(s) 
should be continuous numeric variables (for example, INCOME). 


Independent(s). Select one or more continuous or categorical variables (grouping 
variables). 


Iterations. Enter the maximum number of iterations. If not stated, the maximum is 50. 


Convergence. Enter the relative change in estimates—if all such changes are less than 
the specified value, convergence is assumed. 


Polarity. Enter the polarity of the preferences when doing preference mapping. If the 
smaller number indicates the least and the higher number the most, select Positive. For 
example, a questionnaire may include the question “please rate a list of movies where 
one star is the worst and five stars is the best.” If the higher number indicates a lower 
ranking and the lower number indicates a higher ranking, select Negative. For 
example, a questionnaire may include the question “please rank your favorite sports 
team where | is the best and 10 is the worst.” 

Loss. Specify a loss function to apply in model estimation: 

m Stress. Conjoint analysis minimizes Kruskal's STRESS. 


m Tau. Conjoint analysis maximizes Kendall's tau-1. 
Regression. Specify the regression form: 


m Monotonic. Regression function is monotonically increasing or decreasing. If 
LOSS=STRESS, this is Kruskal’s MONANOVA model. 


Linear. Regression function is ordinary linear regression. 
Log. Regression function is logarithmic. 


Power. Regression function is of the form у = ах“. This is useful for Box-Cox 
models. 


Save estimates. Saves parameter estimates into a file. 
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Using Commands 


To request a conjoint analysis: 


CONJOINT 
MODEL depvarlist = indvarlist 
ESTIMATE / ITERATIONS =n , 
CONVERGENCE = d, 
LOSS = STRESS 


ТАЏ, 
REGRESSION = MONOTONIC 
LINEAR 
LOG 
POWER , 
POLARITY = POSITIVE 
NEGATIVE, 


SAMPLE = BOOT(m,n) or SIMPLE(m,n) or JACK 


Usage Considerations 


Types of data. CONJOINT uses rectangular data only. 
Print options. The output is standard for all PLENGTH options. 


Quick Graphs. Quick Graphs produced by CONJOINT are utility functions for each 
predictor variable in the model. 


Saving files. CONJOINT saves parameter estimates as one case into a file if you precede 
ESTIMATE with SAVE. 


BY groups. CONJOINT analyzes data by groups. Your file need not be sorted on the BY 
variable(s). 


Case frequencies. FREQUENCY <variable> increases the number of cases by the 
FREQUENCY variable. 


Case weights. WEIGHT is not available in CONJOINT. 
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Examples 


Example 1 
Choice Data 


The classical application of conjoint analysis is to product choice. The following 
example from Green and Rao (1971) shows how to fit a nonmetric conjoint model to 
some typical choice data. 


The input is: 


CONJOINT 
USE BRANDS 
MODEL RESPONSE=DESIGN$. .GUARANTS$ 
ESTIMATE / POLARITY=NEGATIVE 


The output is: 

Iterative Conjoint Analysis 

Monotonic Regression Model 

Data are ranks 

Loss Function is Kruskal STRESS 
Factors and Levels 


DESIGNS BRANDS PRICE SEALS GUARANTS 


A Bissell 1.19 но NO 
B Glory 1.39 YES YES 
с K2R 1.59 


Iteration History 


Convergence Criterion : 0.000010 
Maximum Iterations 0 


Max Parameter 


Iteration Loss Change 
1 0.538908 0.264175 
2 0.447639 0.271101 
3 0.317082 0.248250 
4 0.174664 0.329063 
5 0.128528 0.170226 
6 0.105073 0.190633 
7 0.087771 0.126196 
8 0.059169 0.233653 
9 0.040701 0.166551 

10 0.016657 0.144876 
11 0.010140 0.139994 
12 0.005824 0.204832 
13 0.001359 0.190077 
14 0.000632 0.034504 
15 0.000116 0.046652 
16 0.000007 0.019244 
17 0.000000 0.015517 
18 0.000000 0.003273 
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19 0.000000 0.000003 
20 0.000000 0.000000 
Parameter Estimates (Part Worth's) 
A B c Bissell Glory K2R PRICE (1) 


-0.330516 0.399756 0.208648 -0.122195 -0.226345 202 194550 0.301647 


PRICE (2) PRICE (3) ЖО. ҮЕ5 NO YES 


0.159020 -0.428528 -0. 130801. -0.101880 -0.038737 0. 504483 
Goodness of Fit (Kendall tau) 
RESPONSE 


1.000000 


RMS Deleted Goodness of Fit Value, i.e. Fit when Parameter (1)=0 
A B c Bissell Glory mee PRICE (1) 


“0.856209 0.699346 0.934641 0.921569 0.843137 0. 856209 0.777778 


PRICE (2) PRICE (3) NO YES NO YES 


0.921569 0.816993 0.947712 0.973856 0.986928 0. 790850 
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Shepard Diagram Profile Plot 
1 10 == и r 
| 
05 4 
j 0 
5 | 00 1 
4 
25 4 
- SS 
А в с 
DESIGNS 
Profile Plot 
119 139 15 
PRICE 


Profile Plot Profile Plot 
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The fitting method chosen for this example is the default nonmetric loss using 
Kruskal’s STRESS statistic. This is the same method used in the MONANOVA 
program (Kruskal and Carmone, 1969). Although the minimization algorithm differs 
from that program, the result should be comparable. The iterations converged to a 
perfect fit (LOSS = 0). That is, there exists a set of parameter estimates such that their 
sums fit the observed data perfectly when Kendall's tau-b is used to measure fit. This 
rarely occurs with real data. 

The parameter estimates are scaled to have zero sum and unit sum of squares. There 
is a single goodness-of-fit value for this example because there is one response. The 
root-mean-square deleted goodness-of-fit values are the goodness of fit when each 
respective parameter is set to zero. This serves as an informal test of sensitivity. The 
lowest value for this example is for the B parameter, indicating that the estimate for B 
cannot be changed without substantially affecting the overall goodness of fit. 

The Shepard diagram displays the goodness of fit in a scatterplot. The Data axis 
represents the observed data values. The Joint Score axis represents the values of the 
combined parameter estimates. For example, if we have parameters al, a2, a3 and b1, 
b2, then every case measured on, say, a2 and 01 will be represented by a point in the 
plot whose ordinate (y value) is a2 + b1. This example involves only one condition per 
“card” or case, so that the Shepard diagram has no duplicate values on the y axis. 
Conjoint analysis can easily handle duplicate measurements either with multiple 
dependent variables (multiple subjects exposed to common stimuli) or with duplicate 
values for the same subject (replications). 

The fitted jagged line is the best fitting monotonic regression of these fitted values 
on the observed data. For a similar diagram, see the chapter *Multidimensional 
Scaling" on page 185 in Statistics III. And note carefully the warnings about 
"degenerate" solutions and other problems. 

You may want to try this example with REGRESSION - LINEAR to see how the 
results compare. The linear fit yields an almost perfect Pearson correlation. This also 
means that GLM can produce nearly the same estimates. 


The input is: 

GLM 
MODEL RESPONSE = CONSTANT + DESIGN$..GUARANTS 
CATEGORY DESIGN$..GUARANT$ 


PLENGTH LONG 
ESTIMATE 


The PLENGTH LONG statement causes GLM to print the least-squares estimates of the 
marginal means that, for an additive model, are the parameters we seek. The GLM 
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parameter estimates will differ from the ones printed here only by a constant and 
scaling parameter. Conjoint analysis always scales parameter estimates to have zero 
sum and unit sum of squares. This way, they can be thought of as utilities over the 
experimental domain—some negative, some positive. 


Example 2 
Word Frequency 


The data set WORDS contains the most frequently used words in American English 
(Carroll et al., 1971). Three measures have been added to the data. The first is the (most 
likely) part of speech (PARTS). The second is the number of letters (LETTERS) in the 
word, The third is a measure of the meaning (MEANINGS). This admittedly informal 
measure represents the amount of harm done to comprehension (1 = а little, 4 = а lot) 
by omitting the word from a sentence. While linguists may argue over these 
classifications, they do reveal basic differences. Instead of using a measure of 
frequency, we will work with the rank order itself to see if there is enough information 
to fit a model. This time, we will maximize Kendall's tau-b directly. 


The input is: 


CONJOINT 
USE WORDS 
LET RANK-CASE 
MODEL RANK - LETTERS PART$ MEANING 
ESTIMATE / LOSS-TAU, POLARITY-NEGATIVE 


The output is: 

Iterative Conjoint Analysis 
Monotonic Regression Model 
Data are ranks 


Loss Function is 1-(1*tau)/2 


Factors and Levels 


LETTERS PART$ MEANING 
1 adjective 1 
2 adverb 2 
3 conjunction 3 
4 preposition 
pronoun 


verb 


Conjoint Analysis 


Iteration History 


Convergence Criterion : 0.000010 че. е 
Maximum Iterations : 50 
Max Parameter 
Iteration Loss Change 
1 0.204218 0.095537 
2 0.198807 0.091167 
3 0.189789 9. 
4 0.186182 
5 0.184379 4 & 
6 0.182575 ^ 
7 0,182525 «ino 18 . 
8 0.182575 0.000000 
Parameter Estimates (Part Worth's) 
LETTERS(1)  LETTERS(2)  LETTERS(3)  LETTERS(4) adjective adverb 
0.153796 0.173729 -0.075540 -0.269941 -0.119144 -0.272801 
conjunction . preposition pronoun | verb MEANING (1) MEANING (2) 
------------------.------------------------ се „ль жыр >с 1 
-0.261956 0.214718 0.173243 -0.162163 0.748560 -0.120885 
MEANING (3) 3 2% 


-0.181616 
Goodness of Fit (Kendall tau) 


0.634850 ou 
RMS Deleted Goodness of Fit Value, i.e. Fit when Parameter (1)=0 
LETTERS (1) LETTERS (2) LETTERS(3) LETTERS (4) adjective adverb 


———Á— ----------------2--------------------------------------- 


0.627636 0.609600 0.634850 0.605993 0.634850 0.616814 


conjunction preposition pronoun verb MEANING(1) МЕАМІМС(2) 


--------------------------------------- 


0.602386 0.613207 0.609600 0.631243 0.494173 


МЕАМІМС (3) 


0.609600 
wy 
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Shepard Diagram Profile Plot 


FF ee 


The Shepard diagram reveals a slightly curvilinear relationship between the data and 
the fitted values. We can parameterize that relationship by refitting the model as 
follows: 

ESTIMATE / REGRESSION=POWER, POLARITY=NEGATIVE 
SYSTAT will then print Computed Exponent; 1.391739. We will further examine this 
type of power function in the Box-Cox example. 
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The output tells us that, in general, shorter words are higher on the list, adverbs are 
lower, and prepositions are higher. Also, the most frequently occurring words are 
generally the most disposable. These statements must be made in the context of the 
model, however. To the extent that the separate statements are inaccurate when the data 
are examined separately for each, the additive model is violated. This is another way 
of saying that the additive model is appropriate when there are no interactions or 
configural effects. Incidentally, when these data are analyzed with GLM using the 
(inverse transformed) word frequencies themselves rather than rank order in the list, 
the conclusions are substantially the same. 


Example 3 
Box-Cox Model 


Box and Cox (1964) devised a maximum likelihood estimator for the exponent in the 
following model: 


Е{у*}=Х8 


where Х is a matrix of known values, b is a vector of unknown parameters associated 
with the transformed observations, and the residuals of the model are assumed to be 
normally distributed and independent. The transformation itself is assumed to take the 


following form: 


(A) _ 
КОВ 3: abi gta 


оу) 4-0 
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The following is a SYSTAT program (originally coded by Grant Blank) to compute the 
Box-Cox exponent and its standard error. The following comments document the 


program flow: 


GLM 
USE BOXCOX 
CATEGORY TREATMEN POISON 
MODEL Y=CONSTANT+TREATMEN+POISON 
SAVE TEMP / MODEL 
ESTIMATE 
USE TEMP 
SSAVE GMEAN 
LET GMY = Y 
CSTATISTICS GMY / GMEAN 
MERGE GMEAN(GMY) TEMP (Y,X(1..5)) 
IF CASE=1 THEN LET GMEAN=GMY 
IF CASE>1 THEN LET GMEAN=LAG (СМЕАМ) 
NONLIN 
MODEL Y = BO + B1*X(1) + B2*X(2) + B3*X(3) + B4*X(4) + B5*X (5) 
LOSS ((Y^POWER-1) /(POWER*GMEAN^ (POWER-1)) -ESTIMATE)^2 
ESTIMATE 


This program produces an estimate of 0.750 for lambda, with a 9594 Wald confidence 
interval of (-1.181, -0.319). This is in agreement with the results in the original paper. 
Box and Cox recommend rounding the exponent to —1 because of its natural 
interpretation (rate of dying from poison). In general, it is wise to round such 
transformations to interpretable values such as ... —1, -0.5, 0, 0.5, 2 ... to facilitate the 
interpretation of results. 

The Box-Cox procedure is based on a specific model that assumes normality in the 
transformed data and that focuses on the dependent variable. We might ask whether it 
is worthwhile to examine transformations of this sort without assuming normality and 
resorting to maximum likelihood for our answer. This is especially appropriate if our 
general method is to find an “optimal” estimate of the exponent and then round it to 
the nearest interpretable value based on a confidence interval. Indeed, two discussants 
of the Box and Cox paper, John Hartigan and John Tukey, asked just that. 

The conjoint model offers one approach to this question. Specifically, we 
power function relating the y data values to the predictor variables in our mo 
see how it converges. 


can use a 
del and 


The input is: 


CONJOINT 
USE BOXCOX 
MODEL Y=POISON TREATMEN 
ESTIMATE / REGRESS=POWER 
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The output is: 


Iterative Conjoint Analysis 


Power Regression Model 
Data are dissimilarities 
Loss Function is Least Squares 


Factors and Levels 
POISON TREATMEN 


Iteration History 


0.000010 


Convergence Criterion 
50 


Maximum Iterations 


Max Parameter 


Iteration Loss Change 
1 0.197779 0.102447 
2 0.166190 0.053074 
3 0.159477 0.147332 
4 0.157122 0.097312 
5 0.156227 0.015662 
6 0.155991 0.019343 
7 0.155928 0.014996 
8 0.155917 0.002554 
9 0.155914 0.004019 

10 0.155913 0.000252 
11 0.155913 0.000250 
12 0.155913 0.000204 
13 0.155913 0.000233 
14 0.155913 0.000121 
15 0.155913 0.000021 
16 0.155913 0.000014 
17 0.155913 0.000004 
18 0.155913 0.000009 


Computed Exponent: -1.015078 


Parameter Estimates (Part Worth's) 
POISON (3) TREATMEN (1) TREATMEN (2) TREATMEN (3 


POISON(1) POISON (2) 


-0.264114 


Goodness of Fit (Pearson Correlation) 


-0.918742 


RMS Deleted Goodness of Fit Value, i.e. Fit when Parameter (1) =0 


POISON(3) TREATMEN (1) TREATMEN (2) TREATMEN (3) 


POISON(1) POISON (2) 


0.784680 0.866249 0.867493 0.913554 


0.871855 0.912197 
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TREATMEN (4) 
0.898124 
Shepard Diagram 
Profile Plot Profile Plot 
10 v == | "t сені па 
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POISON TREATMEN 


On each iteration, CONJOINT transforms the observed (у) values by the current estimate 
of the exponent, regresses them on the currently weighted X variables (using the 
conjoint parameter estimates), and computes the loss from the residuals of that 
regression. Over iterations, this loss is minimized and we get to view the final fit in the 
plotted Shepard diagram. 

The CONJOINT program produced an estimate of —1.015 for the exponent. Draper 
and Hunter (1969) reanalyzed the poison data using several criteria suggested in the 
discussion to Box and Cox's paper and elsewhere (minimizing interaction F-ratio, 
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maximizing main-effects F-ratios, and minimizing Levene’s test for heterogeneity of 
within-group variances). They found the “best” exponent to be in the neighborhood 
of -1. 


Example 4 
Employment Discrimination 


The following table shows the mean salaries (S4LNOW) of employees at a Chicago 
bank. The bank was involved in a discrimination lawsuit, and the focus of our interest 
is whether we can represent the salaries by a simple additive model. At the time these 
data were collected, there were no black females with a graduate school education 
working at the bank. The education variable records the highest level reached. 


High School College Grad School 
White Males 11735 16215 28251 
Black Males 11513 13341 20472 
White Females 9600 13612 11640 
Black Females 8874 10278 


Let us regress beginning salary (SALBEG) and current salary (SALNOW) on the gender 
and education data. To represent our model, we will code the categories with integers: 
for gender/race, 1 -black females, 2=white females, 3=black males, 4=white males; for 
education, 1=high school, 2=college, 3=grad school. These codings order the salaries 
for both racial/gender status and educational levels. 


The input is: 


USE BANK 

IF SEX=1 AND MINORITY=1 THEN LET GROUP=1 

IF SEX=1 AND MINORITY=0 THEN LET GROUP=2 

IF SEX=0 AND MINORITY=1 THEN LET GROUP=3 

IF SEX=0 AND MINORITY=0 THEN LET GROUP=4 

LET EDUC=1 

IF EDLEVEL>12 THEN LET ois 

IF EDLEVEL>16 THEN LET = ; 

LABEL GROUP / 1="В1аск Females",2-"White Females", 
3-"Black Males",4-"White Males" 

LABEL EDUC / 1="High School" ,2-"College",3-"Grad School' 

CONJOINT 

MODEL SALBEG, SALNOW=GROUP EDUC 

ESTIMATE / REGRESS-POWER 
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The output is: 

Iterative Conjoint Analysis 

Power Regression Model 

Data are dissimilarities 

Loss Function is Least Squares 
Factors and Levels 


Black Females High_School 
White Females College 
Black Males Grad School 
White Males 


Iteration History 


Convergence Criterion : 0.000010 
Maximum Iterations t 50 


Max Parameter 


Iteration Loss Change 
1 0.393275 0.093113 
2 0.373447 0.297339 
3 0.363179 0.299693 
4 0.360521 0.141548 
5 0.358912 0.095797 
6 0.358629 0.060458 
7 0.358482 0.019857 
8 0.358442 0.009334 
9 0.358428 0.006259 

10 0.358426 0.004379 
11 0.358423 0.001950 
12 0.358422 0.000838 
13 0.358422 0.000239 
14 0.358422 0.000551 
15 0.358422 0.000181 
16 0.358422 0.000078 
17 0.358422 0.000051 
18 0.358422 0.000078 
19 0.358422 0.000018 
20 0.358422 0.000004 
21 0.358422 0.000002 
22 0.358422 0.000001 
23 0.358422 0.000001 
24 0.358422 0.000000 


Computed Exponent: -0.075515 


Parameter Estimates (Part Worth's) 


GROUP(1) GROUP(2) GROUP(3) GROUP (4) EDUC (1) EDUC(2) EDUC(3) 


-0.363485 -0.197707 -0.029682 0.150008  -0.364256 -0.014940 0.820062 


Goodness of Fit (Pearson Correlation) 
0.814679 0.787054 
RMS Deleted Goodness of Fit Value, i.e. Fit when Parameter(i)*0 
GROUP (1) GROUP (2) GROUP (3) GROUP (4) EDUC (1) EDUC (2) EDUC (3) 


0.782223 0.785950 0.800797 0.794471 0.751001 0.800896 0.697731 
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Shepard Diagram 
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The computed exponent (0.07551 5) suggests that a log transformation would be 

appropriate for fitting a parametric model. The two salary measurements (salary at 

time of hire and at time of the study) perform similarly, although beginning salary 

shows a slightly better fit to the additive model (0.814679 versus 0.787054). You can 
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see the difference in the two printed Shepard diagrams. The estimates of the parameters 
shows clear ordering in the categories. 

Check for sensitivity of the parameter estimates by examining the root-mean-square 
deleted goodness of fit values. The reported values are averages of the fits for both 
SALBEG and SALNOW when the respective parameter is set to zero. Here we find that 
the greatest change in goodness of fit corresponds to a change in the Grad School 
parameter. 


Transformed Additive Model 


The transformed additive model removes the highly significant interaction for 
SALNOW and almost removes it for SALBEG in these data. You can see this by re- 
coding the education and gender/race variables with the parameter estimates from the 
conjoint analysis. 


The input is: 


IF GROUP-1 THEN LET G--.365 
IF GROUP-2 THEN LET G--.2 
IF GROUP-3 THEN LET G--.033 
IF GROUP-4 THEN LET G=.147 
IF EDUC-1 THEN LET E--.359 
IF EDUC-2 THEN LET E--.011 
IF EDUC-3 THEN LET E-.822 
LET LSALB-LOG (SALBEG) 

LET LSALN-LOG (SALNOW) 

GLM 

MODEL LSALB,LSALN = CONSTANT+E+G+E*G 
ESTIMATE 

HYPOTHESIS 

EFFECT=E*G 

TEST 


The output is: 
N of Cases Processed : 474 
Dependent Variable Means 
LSALB LSALN 


8.753114 9.440502 


Regression Coefficients B = (X'X)x'v 


Factor 


LSALN 
9.530658 
0.653259 
0.721529 
0 


‚+ 0.557859 «351204 
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Multiple Correlations 
LSALB LSALN 
"0.816510 0.788739 
Adjusted R = 1-(1-R? )*(N-1)/df, where N = 474, and df = 470 
Adjusted R 
(0.664561 0.619697 
*** WARNING *** : 
Case 297 has large Leverage (Leverage : 0.127648 


Test for effect called: E*G 


Univariate F Tests 
III SS df Mean Squares F-ratio p-value 


Source | 

Mum umen mm MET e ice --222-2124.11..---24--------------- 
LSALB | 0.275440 1 0.275440 6.595624 0.010531 
Error } 19.627666 470 0.041761 

LSALN | 0.109169 1 0.109169 1.818261 0.178170 
Error | 28.218839 470 0.060040 


Multivariate Test Statistics 


Statistic H Value F-ratio df p-value 
белар он еее... 2-2” ж-----<-<2--<----<<<--------------------- 
Wilks's Lambda | 0.985513 3.447204 2, 469 0.032643 
Pillai Trace | 0.014487 3.447204 2, 469 0.032643 
Hotelling-Lawley Trace | 0.014700 3.447204 2, 469 0.032643 


Ordered Scatterplots 


Finally, let us use SYSTAT to produce scatterplots of beginning and current salary 
ordered by the conjoint coefficients. The SYSTAT code to do this can be found in the 


file CONJO4.SYC. The spacing of the scatterplots should tell the story. 
CONJO4.SYC command script is: 


USE BANK 
IF SEX-1 AND MINORITY-1 THEN LET GROUP-1 


IF SEX-1 AND MINORITY-0 THEN LET GROUP-2 

IF SEX-0 AND MINORITY-1 THEN LET GROUP-3 

IF SEX-0 AND MINORITY=0 THEN LET GROUP-4 

LET EDUC-1 

IF EDLEVEL»12 THEN LET EDUC-2 

IF EDLEVEL»16 THEN LET EDUC-3 

LABEL GROUP / i-"Black Females",2-"White Females", 
3-"Black Males",4-"White Males" 

LABEL EDUC / 1-"Нідһ School",2-"College",3-"Grad School" 

ORDER GROUP / LABEL SORT - "White Males", "Black Males", 

"White Females", "Black Females" 


PLOT SALNOW * SALBEG / GROUP= GROUP, EDUC multiplot 
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The story is mainly in this graph: minorities and women received lower salaries, 
regardless of educational level. There are a few exceptions to the general pattern, but 
overall the bank had reason to settle the lawsuit. 


Computation 


Algorithms 


CONJOINT uses a direct search optimization method to minimize the loss function. vie 
enables minimization of Kendall’s tau. There is no guarantee that the program will fin 
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the global minimum of tau, so it is wise to try several regression types and the STRESS 
loss to be sure that they all reach approximately the same neighbourhood. 


Missing Data 


Missing values are processed by omitting them from the loss function. 
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Correlations, Associations, and 
Distance Measures 


Leland Wilkinson, Laszlo Engelman, and Rick Marcantoni 
(revised by Harshaprabha Shetty) 


Correlations (CORR) computes correlations and measures of similarity and distance. 
It prints the resulting matrix and, if requested, saves it ina SYSTAT file for further 
analysis, such as multidimensional scaling, cluster, or factor analysis. 

For continuous data, Correlations provides the Pearson correlation, covariances, 
and sum of squares of deviations from the mean and sum of cross-products of 
deviations (SSCP). In addition to the usual probabilities, the Bonferroni and Dunn- 
Sidak adjustments are available with Pearson correlations. If distances are desired, 
Euclidean or city-block distances are available. Similarity measures for continuous 
data include the Bray-Curtis coefficient and the QSK quantitative symmetric 
coefficient (or Kulczynski measure). 

For rank-order data, Correlations provides Goodman-Kruskal’s gamma (see 
Goodman and Kruskal, 1954), Guttman’s mu2, Spearman’s rho, Kendall’s tau b, and 
Stuart’s tau c. 

For unordered data, Correlations provides Phi coefficient, Cramer’s V, 
Contingency coefficient, Goodman-Kruskal’s Lambda (symmetric measure) and 
Uncertainty coefficient (symmetric measure). 

For binary data, Correlations provides S2, the positive matching dichotomy 
coefficient: 53, Jaccard's dichotomy coefficient; 54, the simple matching dichotomy 
coefficient; 55, Anderberg's dichotomy coefficient; 56, Tanimoto's dichotomy 
coefficient; 57, Anderberg's binary similarity coefficient; Yule's Q coefficient; 
Hamman's binary similarity coefficient; Dice's binary similarity coefficient; Sneath 
and Sokal's binary similarity coefficient; Ochiai's binary similarity coefficient; 
Kulczynski's binary similarity coefficient; and Gower? binary similarity coefficient. 

When data are missing, listwise and pairwise deletion methods are available for all 
measures. An EM algorithm is an option for maximum likelihood estimates of 
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correlation, covariance, and cross-products of deviations matrices. For robust ML 
estimates where outliers аге downweighted, the user can specify the degrees of 
freedom for the ż distribution or contamination for a normal distribution. Correlations 
includes a graphical display of the pattern of missing values. Little’s MCAR test is 
printed with the display. The EM algorithm also identifies cases with extreme 
Mahalanobis distances. 

Hadi’s robust outlier detection and estimation procedure is an option for 
correlations, covariances, and SSCP; cases identified as outliers by the procedure are 
not used to compute estimates. 

Resampling procedures are available in this feature. SYSTAT gives a 
summarization based on resampling for Pearson, Spearman, Gamma, Tau b, and MU2 
correlation coefficients. You can get resampling-based estimates along with their bias 
and standard error. Under bootstrap, you will also get a confidence interval for the 
parameter concerned using two popular methods, viz., Percentile method and Bias 
corrected and accelerated method. 


Statistical Background 


SYSTAT computes many different measures of the strength of association between 
variables. The most popular measure is the Pearson correlation, which is appropriate 
for describing linear relationships between continuous variables. However, CORR 
offers a variety of alternative measures of similarity and distance appropriate if the data 
are not continuous. 

Let us look at an example. The following data, from the CARS file, are taken from 
various issues of Car and Driver and Road & Track magazine. They are the car 
enthusiasts’ equivalent of Consumer Reports performance ratings. The cars rated — 
include some of the most expensive and exotic cars in the world (for example, Ferrar! 
Testarossa) as well as some of the least expensive but sporty cars (for example, Honda 
Civic CRX). The attributes measured аге 0-60 m.p.h. acceleration, braking distance 1n 
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feet from 60-0 m.p.h., slalom times (speed over a twisty course), miles per gallon, and 
top speed in miles per hour. 


ACCEL 


5.0 
5.3 
5.8 
7.0 
7.6 
7.9 
8.5 
8.7 
9.3 
10.8 
13.0 


BRAKE 
245 
242 
243 
267 
271 
259 
263 
287 
258 
287 
253 


SLALOM 
61.3 
61.9 
62.6 
57.8 
59.8 
61.7 
539 
64.2 
64.1 
60.8 
62.3 


Тһе Scatterplot Matrix (SPLOM) 


MPG 
17.0 
12.0 
19.0 
14.5 
21.0 
19.0 
17.5 
35.0 
24.5 
25.0 
27.0 


SPEED 
153 
181 
154 
145 
124 
130 
131 
115 
129 
100 
95 


NAMES 
Porsche 911T 
Testarossa 
Corvette 
Mercedes 560 
Saab 9000 
Toyota Supra 
BMW 635 
Civic CRX 
Acura Legend 
VW Fox GL 
Chevy Nova 


A convenient summary that shows the relationships between the performance variables 
is to arrange them in a matrix. A matrix is a rectangular array. We can put any sort of 
numbers in the cells ofthe matrix, but we will focus on measures of association. Before 
doing that, however, let us examine a graphical matrix, the scatterplot matrix 


(SPLOM). 
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Scatter Plot Matrix 


This matrix shows the histogram of each variable on the diagonal and the scatterplots 
(х-у plots) of each variable against the others. For example, the scatterplot of | 
acceleration versus braking is at the top of the matrix. Since the matrix is symmetric, 
only the bottom half is shown. In other words, the plot of acceleration versus braking 
is the same as the transposed scatterplot of braking versus acceleration. 


The Pearson Correlation Coefficient 


Now, assume that we want a single number that summarizes how well we coula wee 
acceleration from braking using a straight line. For linear regression, we discuss 
we calculate such a line, but it is enough here to know that we are interested in врея : 
a line through the area covered by the points in the scatterplot such that, on the average, 
the acceleration of a car could be predicted rather well by the value on the s» ж 
corresponding to its braking. The closer the points cluster around this line, the be 
would be the prediction. 

In addition, we want this number to represent simultaneously how well we сап 
predict braking from acceleration using a similar line. This symmetry we seek is m 
fundamental to all the measures available in CORR. It means that, whatever the ei be 
on which we measure our variables, the coefficient of association we compute W! 
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the same for either prediction. If this symmetry makes no sense for a certain data set, 
then you probably should not be using CORR. 

The most common measure of association is the Pearson correlation coefficient, 
which varies between —1 and +1. A Pearson correlation of 0 indicates that neither of 
two variables can be predicted from the other by using a linear equation. A Pearson 
correlation of +1 indicates that one variable can be predicted perfectly by a positive 
linear function of the other, and vice versa. And a value of -1 indicates the same, 
except that the function has a negative sign for the slope of the line. 


Following is the Pearson correlation matrix corresponding to this SPLOM: 


Pearson Correlation Matrix 


ACCEL BRAKE SLALOM MPG SPEED 


1651 0.622 0.597 1.000 
1908 -0.665 -0.115 -0.768 1.000 


Try superimposing in your mind the correlation matrix on the SPLOM. The Pearson 
correlation for acceleration versus braking is 0.466. This correlation is positive and 
moderate in size. On the other hand, the correlation between acceleration and speed is 
negative and quite large (0.908). You can see in the lower left corner of the SPLOM 
that the points cluster around a downward sloping line. In fact, all of the correlations 
of speed with the other variables are negative, which makes sense since greater speed 
implies greater performance. The same is true for slalom performance, but this is 
clouded by the fact that some small but slower cars like the Honda Civic CRX are 
extremely agile. 

Keep in mind that the Pearson correlation measures linear predictability. Do not 
assume that a Pearson correlation near 0 implies no relationship between variables. 
Many nonlinear associations (U- and S-shaped curves, for example) can have Pearson 


correlations of 0. 


Other Measures of Association 


CORR offers a variety of other association measures. There is not room here to discuss 
all of them, but let us review some briefly. 
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Dissimilarity and Distance Measures 


These measures include the Bray-Curtis (BC) dissimilarity measure, the quantitative 
symmetric dissimilarity coefficient, the Euclidean distance, and the city-block 
distance. 

Euclidean and city-block distance measures have been widely available in software 
packages for many years; Bray-Curtis and QSK are less common. For each pair of 
variables 


2, а — Хун] 
k 
VN ха x Ху 
ГА 
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bns d 


уму» 
* * 


1 Е 
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where i andj are variables and k is cases. After an extensive computer simulation study, 
Faith, Minchin, and Belbin (1987) concluded that BC and QSK were "effective as 
robust measures" in terms of both rank and linear correlation. The use of these 
measures is similar to that for Correlations (Pearson, Covariance, and SSCP), except 
that the EM, Prob, Bonferroni, Dunn-Sidak, and Hadi options are not available. 


Measures for Rank-Order Data 


Several measures are available for rank-order data: Goodman-Kruskal's gamma, 
Guttman's mu2, Spearman's rho, Kendall's tau b, and Stuart's tau с. Each one measures 
some aspect of rank-order association. The one closest to Pearson is the Spearman. 
Spearman's rho is simply a Pearson correlation computed on the same data after 
converting them to ranks. Goodman-K ruskal's gamma and Kendall's tau b reflect the 
tendency for two cases to have similar orderings on two variables. However, the former 
focuses on cases which are not tied in rank orderings. If no ties exist, these two 
measures will be equal. For tied cases, we can use Stuart’s tau c. Kendall's tau b cannot 
attain +1 even incase of complete association, unless the number of levels of the two 
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factors is the same. Tau b and tau c take the same value, if the marginal frequencies are 
the same (Kendall and Stuart (1979)). 


The following is the matrix computed for Spearman's rho: 
Spearman Correlation Matrix 


ACCEL BRAKE SLALOM MPG SPEED 


MPG i .815 0.502 0.487 1.000 


1 
0 
SLALOM | 0.245 -0.305 1.000 
0 
SPEED | -0.891 -0.651 -0.109 -0.884 1.000 


It is often useful to compute both a Spearman and Pearson matrix on (һе same data. 
The absolute difference between the two can reveal unusual features. For example, the 
greatest difference for our data is on the slalom-braking correlation. This is because the 
Honda Civic CRX is so fast through the slalom, despite its inferior brakes, that it 
attenuates the Pearson correlation between slalom and braking. The Spearman 
correlation reduces its influence. 


Measures for Unordered Data 


Several measures are available for unordered data: Phi coefficient, Contingency 
coefficient, Cramer’s V, Goodman-Kruskal lambda and Uncertainty coefficient. Phi, 
Contingency coefficient and Cramer’s V are based on the Pearson chi-square statistic. 
The Phi coefficient is the most basic one. If the observations from a bivariate normal 
distribution with population correlation р are classified into a contingency table, then 
the square of the contingency coefficient tends to be р? , as the number of categories 
increases. The contingency coefficient lies in the interval [0,1]; generally, it cannot 
attain the upper limit. The advantage of Cramer's V over the contingency coefficient 
is that it attains its maximum value, i.e. 1 for complete association. Goodman- 
Kruskal's lambda finds the information given by one variable based on the information 
given by the other one. Lambda explains the percentage reduction in error while 
detecting the other variable. The Uncertainty coefficient concentrates on the amount of 
uncertainty of one variable explained by the other one. 
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For example, we have classified 40 countries according to their government 
(Democracy, One Party, and Military) and their leaders (Islamic, Catholic, Marxist, 
and Protestant). Let the contingency table be as follows: 
Catholic Islamic Marxist Protestant 
Democracy 17 0 0 11 
Military 0 7 0 0 
One Party 0 0 5 0 
We calculate the phi coefficient, contingency coefficient, and Cramer’s V of the above 
table as follows: 
Phi coefficient 0.414 
Contingency coefficient 0.816 
Cramer’s V 1.000 
From these values, there is a clear indication of ‘complete’ association between the 
government and its leaders. Cramer’s V attains 1 for complete association whereas the 
contingency coefficient is 0.816. 
Measures for Binary Data 


CORR offers the following association measures for binary data: positive matching 
dichotomy coefficient (S2), Jaccard’s dichotomy coefficient (S3), simple matching 
dichotomy coefficient (S4), Anderberg’s dichotomy coefficient (55), Tanimoto’s 
dichotomy coefficient (S6), Anderberg’s binary similarity coefficient (S7), Yule’s Q 
coefficient, Hamman's binary similarity coefficient, Dice's binary similarity | 
coefficient, Sneath and Sokal's binary similarity coefficient, Ochiai's binary similarity 
coefficient, Kulczynski's binary similarity coefficient, Gower2 binary similarity 
coefficient, and tetrachoric correlations. 


Dichotomy coefficients. These coefficients relate variables whose values may 
represent the presence or absence of an attribute or simply two values. They are 
documented in Gower (1985). These coefficients were chosen for SYSTAT because 
they are metric and produce symmetric positive semidefinite (Gramian) matrices, | 
provided that you do not use the pairwise deletion option. This makes them suitable for 
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multidimensional scaling and factoring as well as clustering. The following table 
shows how the similarity coefficients are computed: 


J 
RUM. 
x, Ta b | atb 
t of e | id^ гена 
atc ђ+а 
a 
52 = = Proportion of pairs with both values present 
atb+ct+d 
ёз = — 83 Proportion of pairs with both values present 
а+ђ+е given that at least one occurs 
$4 = aa Proportion of pairs where the values of both 
ање variables agree 
$5 = а S3 standardized by all possible patterns of 
а+2(6+с) agreement and disagreement 
S6 = аза 54 standardized by all possible patterns of 


a*2(b*c)* d agreement and disagreement 
When the absence оҒап attribute in both variables is deemed to convey no information, 
d should not be included in the coefficient (see S3 and S5). 


Yule's Q coefficient. Yule defined a measure Q, which becomes 0 when the variables 
are independent, becomes +1 when there is complete association, and becomes -1 
when there is no association. The coefficient is given by: 


ad — bc 
ad * bc 


О = 


The other similarity binary coefficients are: 
Hamman’ s- binary similarity coefficient 


_ (at d) - (b * c) 
Hamman — Чатът) 


Dice’s- binary similarity coefficient 
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2a 
(2a+b+c) 
Anderberg’s- binary similarity coefficient 


Dice = 


Sg ЧА ЛЫНА 
a+2(b+c) 


Sneath and Sokal’s- binary similarity coefficient 


Sneath = 2(а+4) 
2(a+d)+(b+c) 
Ochiai’s- binary similarity coefficient 
i вина 
((a+b)(a+e)) 
Kulezynski's- binary similarity coefficient 
КИН _@_ 
(а+ђ) (а+с) 
2 


Gower2's- binary similarity coefficient 
Gower = ad 


((а+ ба +сха + b)(d + су): 


Tetrachoric correlation. While the data for this measure are binary, they are assumed 
to be a random sample from a bivariate normal distribution. For example, 
a horizontal line and a vertical line on this bivariate normal distribution an 
number of observations in each quadrant. 


let us draw 
d count the 
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хо 


A large proportion of the observations fall іп the upper right and lower left quadrants 
because the relationship is positive (the Pearson correlation is approximately 0.70). 
Correspondingly, if there were a strong negative relationship, the points would 
concentrate in the upper left and lower right quadrants. If the original observations are 
no longer available but you do have the frequency counts for the four quadrants, try a 
tetrachoric correlation. 

The computations for the tetrachoric correlation begin by finding estimates of the 
inverse cumulative marginal distributions: 


z value for x = Ф! {Кз and z value for y; = ©"! (ee) 
45 45 
and using these values as limits when integrating the bivariate normal density 
expressed in terms of r, the correlation, and then solving for r. 
If you have the original data, do not bother dichotomizing them because the 


tetrachoric correlation has an efficiency of 0.40 compared with the efficient Pearson 


correlation estimate. 


Transposed Data 


You can use CORR to compute measures of association on the rows or columns of your 
and then use CORR. This makes sense when you want 


data. Simply transpose the data 
s. We might be interested in identifying similar cars 


to assess similarity between row 


| 
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from our performance measures, for example. Recall that you cannot transpose a file 
that contains character data. 

When you compute association measures across rows, however, be sure that the 
variables are on comparable scales. Otherwise, a single variable will influence most of 
the association. With the cars data, braking and speed are so large that they would 
almost uniquely determine the similarity between cars. Consequently, we standardized 
the data before transposing them. That way, the correlations measure the similarities 
comparably across attributes. 

Following is the Pearson correlation matrix for our cars: 


Number of Observations: 5 


Pearson Correlation Matrix 


! Porsche Testarossa Corvette Mercedes Saab Toyota BMW Honda 


Porsche 1.000 

Testarossa 0.940 1.000 

Corvette 0.938 0.868 1.000 

Mercedes 0.093 0.212 -0.240 1.000 

Saab -0.506 -0.523 -0.760 0.664 1.000 


i 
і 
! 
| 
| 
i А 
Toyota | 0.238 0.429 0.402 -0.379 -0.681 1.000 
Н 
! 
Н 
! 
H 


BMW 70.319 .-0.095 -0.557 0.854 0.634 -0.247 
Honda -0.504 -0.730 -0.392 -0.519 0.265 -0.298 
Acura -0.046 -0.102 0.298 -0.978 -0.770 0.533 
VW -0.962 -0.928 -0.980 0.079 0.704 -0.353 
Chevy -0.731 -0.698 -0.491 -0.532 -0.131 -0.033 


Pearson Correlation Matrix (contd...) 


Porsche 
Testarossa 
Corvette 
Mercedes 
Saab 


1.000 
-0.156 1.000 
0.536 0.525 1.000 


Hadi Robust Outlier Detection 


The Hadi robust outlier detection identifies specific cases as outliers (if there are any) 
and then uses the acceptable cases to compute the requested measure in the usual way. 
The following are the steps for this procedure: 


m Compute a "robust" covariance matrix by finding the median (instead ofthe гөй 
for each variable and using E(x, — median)’ in the calculation of each covariance. 
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If the resulting matrix is singular, reconstruct another after inflating the smallest 
eigenvalues by a small amount. 

Use this robust estimate of the covariance matrix to compute Mahalanobis 
distances and then use the distance to rank the cases. 

Use the half of the sample with the lowest ranks to compute the usual covariance 
matrix (that is, deviations from the mean). 

Use this covariance matrix to compute new distances for the complete sample and 
rerank the cases. 

After ranking, select the same number of cases with small ranks as before but add 
the case with the next largest rank and repeat the process, each time updating the 
covariance matrix, computing and sorting new distances, and increasing the 
subsample size by one. 

Continue adding cases until the entering one exceeds an internal limit based on a 
chi-square statistic (see Hadi, 1994). The cases remaining (not entered) are 
identified as outliers. 

Use the cases that are not identified as outliers to compute the measure requested 
in the usual way. 
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Simple Correlations in SYSTAT 


Simple Correlations Dialog Box 


To open the Simple Correlations dialog box, from the menus choose: 


Analyze 
Correlations 
Simple... 


" Analyze:Correlations:Simple 


Available variable(s]: 


COUNTRYS | 
POP_1983 = 
РАР 1986 | (Add => 


POP_1990 ЕТЕП 
РОР 2020 Елден 
URBAN 

BIRTH. 82 

BIRTH RT 
DEATH, 82 
DEATH, RT 
ВАВҮМТ82 
BABYMORT 

LIF 

Dos i я L 
Types DP acm : — p Deletion 
(9) Continuous data: | © Listwise 


О Distance measures: Bray- Curtis О Pairwise 
О Rank order data’ Spearman 

О Unordered data: Pii [Г] Save matrix 
O Binary data: Positive matching (5 2] | кыйк 
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The following options are available: 


Selected variable(s). Available only if One is selected for Sets. All selected variables 
are correlated with all other variables in the list, producing a triangular correlation 
matrix. 


Row(s). Available only if Two is selected for Sets. Selected variables are correlated 
with all column variables, producing a rectangular matrix. 


Column(s). Available only if Two is selected for Sets. Selected variables are correlated 
with all row variables, producing a rectangular matrix. 


Sets. One set creates a single, triangular correlation matrix of all variables in the 
Selected Variable(s) list. Two sets creates a rectangular matrix of variables in the 
Row(s) list correlated with variables in the Column(s) list. 


Listwise. Listwise deletion of missing data. Any case with missing data for any 
variable in the list is excluded. 


Pairwise. Pairwise deletion of missing data, only cases with missing data for one of the 
variables in the pair being correlated are excluded. 


Save matrix. Saves the correlation matrix to a file. 


Types. Type of data or measure. You can select from a variety of distance measures, as 
well as measures for continuous data, rank-order data, binary data and unordered data. 


Measures for Continuous Data 


The following measures are available for continuous data: 


m Pearson. Produces a matrix of Pearson product-moment correlation coefficients. 
Pearson correlations vary between —1 and +1. A value of 0 indicates that neither of 
two variables can be predicted from the other by using a linear equation. A Pearson 
correlation of +1 or —1 indicates that one variable can be predicted perfectly by a 


linear function of the other. 
Covariance. Produces a covariance matrix. 


SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen, 
sums are weighted by N/n, where n is the count for a pair, and N is the number of 


cases. 


The Pearson, Covariance, and SSCP measures are related. The entries in an SSCP 
matrix are sum of squares of deviations (from the mean) and sum of cross-products of 
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deviations. If you divide each entry by (п- 1), variances result from the sum of 
squares and covariances from the sum of cross-products. Divide each covariance by 
the product of the standard deviations (of the two variables) and the result is a 
correlation. 


Distance and Dissimilarity Measures 


Correlations offers two dissimilarity measures and two distance measures: 


Bray-Curtis. Produces a matrix of dissimilarity measures for continuous data. 


QSK. Produces a matrix of symmetric dissimilarity coefficients. Also called the 
Kulezynski measure. 


Euclidean. Produces a matrix of Euclidean distances normalized by the sample 
size. 


City. Produces a matrix of “city-block,” or first-power, distances (sum of absolute 
discrepancies) normalized by the sample size. 


Measures for Rank Order Data 


If your data are simply ranks of attributes, or if you want to see how well variables are 
associated when you pay attention to rank ordering, you should consider the following 
measures available for ranked data: 


Spearman. Produces a matrix of Spearman rank-order correlation coefficients. 
This measure is a nonparametric version of the Pearson correlation coefficient, 
based on the ranks of the data rather than the actual values. 


Gamma. Produces a matrix of Goodman-Kruskal's gamma coefficients. 
MU2. Produces a matrix of Guttman's mu2 monotonicity coefficients. 
Tau b. Produces a matrix of Kendall’s tau b rank-order coefficients. 


Тап с. Produces a matrix of Stuart’s tau с coefficients. 


Measures for Unordered Data 


If your data are categorical, you should consider the following measures available for 
unordered data: 


Phi. Produces a matrix of Phi coefficients. 
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Cramer’s V. Produces a matrix of Cramer’s V coefficients. 
Contingency. Produces a matrix of contingency coefficients. 


Lambda. Produces a matrix of symmetric Goodman-Kruskal’s lambda 
coefficients. 


Uncertainty. Produces a matrix of symmetric uncertainty coefficients. 


Measures for Binary Data 


These coefficients relate variables assuming only two values. The dichotomy 
coefficients work only for dichotomous data scored as (0 &1) or (1 &2). 


The following measures are available for binary data: 


Positive matching (S2). Produces a matrix of positive matching dichotomy 
coefficients. 

Јассага (53). Produces a matrix of Jaccard’s dichotomy coefficients. 
Simple matching (84). Produces a matrix of simple matching dichotomy 
coefficients. 


Anderberg (55). Produces a matrix of Anderberg’s dichotomy coefficients. 


Tanimoto (S6). Produces a matrix of Tanimoto’s dichotomy coefficients. 
Tetra. Produces a matrix of tetrachoric correlations. 

Yule’s Q. Produces a matrix of Yule’s Q coefficients. 

Hamman. Produces a matrix of Hamman’s binary coefficients. 

Dice. Produces a matrix of Dice’s binary coefficients 

Anderberg (S7). Produces a matrix of Anderberg's binary coefficients. 
Sneath. Produces a matrix of Sneath's binary coefficients. 

Ochiai. Produces a matrix of Ochiai binary coefficients. 

Kulczynski. Produces a matrix of Kulczynski’s binary coefficients. 
Gower2. Produces a matrix of Gower2’s binary coefficients. 
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Options 


To specify options for correlations, click Options tab in the Simple Correlations dialog 
box. 


FB Analyze:Correlations:Simple 


| | Main | Options | Resampli 


p E] Probabilities 
© Bonferroni 
© Dunn-Sidak 
©) Uncorrected 


EM estimation 
(9 Мота! 


© Contaminated normal 
Probability 


A REM 
Ot Degrees of freedom 
Iterations: 

Convergence: 


Had) outlier identification and estimation 


Tolerance 


The following options are available for continuous data: 


Probabilities. Requests probability of each correlation coefficient to test that the 
correlation is 0. Appropriate if you select only one correlation coefficient to test. 
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Bonferroni and Dunn-Sidak use adjusted probabilities. Available only for Pearson 
product-moment correlations. 


EM estimation. Requests the EM algorithm to estimate Pearson correlation, 
covariance, or SSCP matrices from data with missing values. Little’s MCAR test is 
displayed with a graphical display of the pattern of missing values.If your data set does 
not contain outlier(s) use Normal option. For robust estimates where outliers are 
downweighted, select Contaminated Normal or t. 


m Contaminated Normal produces maximum likelihood estimates for a contaminated 
multivariate normal sample. For the contaminated normal, SYSTAT assumes that 
the distribution is a mixture of two normal distributions (same mean, different 
variances) with a specified probability of contamination. The Probability value is 
the probability of contamination (for example, 0.10), and Variance is the variance 
of contamination. Downweighting for the normal model tends to be concentrated 
in a few outlying cases. 

m t produces maximum likelihood estimates for a f distribution, where df is the 
degrees of freedom. Downweighting for the multivariate / model tends to be more 
spread out than for the normal model. The degree of downweighting is inversely 
related to the degrees of freedom. 


Iterations. Specifies the maximum number of iterations for computing the estimates. 


Convergence. Defines the convergence criterion. If the relative change of covariance 
entries are less than the specified value, convergence is assumed. 


Hadi outlier identification and estimation. Requests the Hadi multivariate outlier 
detection algorithm to identify outliers and to compute the correlation, covariance, or 
SSCP matrix from the remaining cases. Tolerance omits variables with a multiple R- 
square value greater than (1 — n), where n is the specified tolerance value. 
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Resampling 


Click on the 'Resampling' tab to specify different resampling options. 


БЕ Analyze:Correlations:Simple 


Se 


| Main | Options| Resampling 
Perform resampling 
Method: Bootstrap 


| Number of samples: 
Sample size: 
Random seed: 
Confidence: 


Perform resampling. Generates samples of cases and uses data thereof to carry out the 


same analysis on each sample. 
Method. Three sampling methods are available: 
ш Bootstrap. Generates bootstrap samples. This is the default method. 


m Without replacement. Generates subsamples without replacement. 
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= Jackknife. Generates jackknife samples. 


Number of samples. Specify the number of samples to be generated. These samples are 
analyzed using the chosen method of sampling. The default is 1. 


Sample size. Specify the size of each sample to be generated while resampling. The 
default sample size is the number of cases in the data file in use. 


Random seed. Specify a random seed to be used while resampling. The default random 
seed is generated by the system. 


Confidence. Specify a confidence level for bootstrap-based confidence interval. Enter 
any value between 0 and 1. The default is 0.95. 


Using Commands 


CORR 
USE filename 
PLENGTH SHORT or MEDIUM/LONG 
SAVE filename/TYPE=RECTANGULAR or SSCP or COVARIANCE or, 
CORRELATION or DISSIMILARITY or SIMILARITY 
For one set: 


PEARSON or COVARIANCE or SSCP varlist / BONF or DUNN or PROB, 
ЕМ or EM NORMAL = nl, n2 or ЕМ T=n or EM T=n ITER=n, 
CONV=n or HADI or HADI TOL=n 


BC or QSK or EUCLIDEAN or CITY varlist/LISTWISE or PAIRWISE 
SPEARMAN or GAMMA or MU2 or TAUB or TAUC varlist/LISTWISE or, 
PAIRWISE 


TETRA or S2 or S3 or S4 or S5 or 56 or YULEQ or HAMMAN or DICE, 
or 57 or SNEATH or OCHIAI or KULCZY or GOWER varlist/LISTWISE or, 


PAIRWISE 


PHI or CRAMER or CONT or UNCE or LAMBDA varlist/ LISTWISE or, 
PAIRWISE 
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For two sets: 


PEARSON or COVARIANCE or SSCP rowlist*collist / BONF or DUNN or, 
PROB ЕМ or EM NORMAL = nl, n2 or EM T=n or EM T=n ITER-n, 
CONV-n or HADI or HADI TOL-n 


BC or QSK or EUCLIDEAN or CITY rowlist*collist /LISTWISE or, 
PAIRWISE 


SPEARMAN or GAMMA or MU2 or TAUB or TAUC rowlist*collist, 
/LISTWISE or PAIRWISE 


TETRA or S2 or S3 or S4 or S5 or S6 or YULEQ or HAMMAN or DICE, 
or S7 or SNEATH or OCHIAI or KULCZY or GOWER, 
rowlist*collist/LISTWISE or PAIRWISE 
PHI or CRAMER or CONT or UNCE or LAMBDA rowlist*collist/ LISTWISE, 
or PAIRWISE 

You can also use resampling procedures with the following options: 


SAMPLE - BOOT(m,n) or SIMPLE(m,n) or JACK 


For getting the summarized resampling output, the following command should be 
given before the hot command indicating the type of correlation. 

SAMPLE BOOT (m,n) or SIMPLE (m,n) or JACK / 

CONFI-c 


You can have the summarized results of resampling for PEARSON, SPEARMAN, 
GAMMA, TAUB, and MU2 correlation coefficients. 


Usage Considerations 


Types of data. CORR uses rectangular data only. 


Print options. With PLENGTH LONG, SYSTAT prints the mean of each variable for 
measures of continuous data. In addition, for EM estimation, SYSTAT prints an 
iteration history, missing value patterns, Little's MCAR test, and mean estimates. 


Quick Graphs. For measures of continuous data, distance and dissimilarity measures, 
rank ordered data, and binary data, CORR includes a SPLOM (matrix of scatterplots) 
where the data in each plot correspond to a value in the matrix. 
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Saving files. CORR saves the correlation matrix or any other measure computed. 
SYSTAT automatically defines the type of file as CORR, DISS, COVA, SSCP, SIMI, or 
RECT. 


BY groups. CORR analyzes data by groups. Your file need not be sorted on the BY 
variable(s). 

Case frequencies. FREQUENCY variable increases the number of cases by the FREQ 
variable. 


Case weights. WEIGHT is available in CORR. 


Examples 


Example 1 
Pearson Correlations 


This example uses data from the OUR WORLD file that contains records (cases) for 57 
countries. We are interested in correlations among variables recording the percentage 
of the population living in cities, birth rate, gross domestic product per capita, dollars 
expended per person for the military, ratio of birth rates to death rates, life expectancy 
(in years) for males and females, percentage of the population who can read, and gross 
national product per capita in 1986. 


The input is: 


CORR 

USE OURWORLD 

PEARSON urban birth rt дар сар mil b to d lifeexpm lifeexpf, 
literacy gnp 86 
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GDP_CAP MIL B TOD LIFEEXPM LIFEEXPF 


1.000 
0.899 1.000 
-0.659 -0.607 1.000 
0.664 0.582 -0.211 1.000 
0.704 0.619 -0.265 0.989 1.000 
. а 0.637 0.562 -0.274 0.911 0.935 
Mu -0.689 | 0.964 0.873 -0.560 0.633 0.665 


| LITERACY СМР 86 
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Scatter Plot Matrix 


pe 


ЫЗ 
ABW Elda dd | 


vee иет uy =т= 


URBM — WATM RT Gorce amp 


The correlations for all pairs of the nine variables are shown here. The bottom of the 
output panel shows that the sample size is 49, but the data file has 57 countries. If a 
country has one or more missing values, SYSTAT, by default, omits all of the data for 
the case. This is called listwise deletion. 

The Quick Graph is a matrix of scatterplots with one plot for each entry in the 
correlation matrix and histograms of the variables on the diagonal. For example, the 
plot of BIRTH. RT against URBAN is at the top left under the histogram for URBAN. 

If linearity does not hold for your variables, your results may be meaningless. A 
good way to assess linearity, the presence of outliers, and other anomalies is to 
examine the plot for each pair of variables in the scatterplot matrix. The relationships 
between GDP_CAP and BIRTH_RT, B_TO_D, LIFEEXPM, and LIFEEXPF do not 
appear to be linear. Also, the points in the MIL versus СОР САР and GNP_86 versus 
MIL display clumps in the lower left corner. It is not wise to use correlations for 


describing these relations. 
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Requesting a Portion of a Matrix 


You can request that only a portion of the matrix be computed. 


The input is: 


CORR 
USE OURWORLD 
FORMAT 
PEARSON lifeexpm lifeexpf literacy gnp_86 *, 
urban birth rt дар cap mil b to d 


The output is: 
Number of Observations: 49 
Pearson Correlation Matrix 


| URBAN BIRTH RT GDP CAP MIL B TO D 
ызы uu uu ПП „АЛИНИН ескек епа стави ае 
LIFEEXPM | 0.776 -0.922 0.664 0.582 -0.211 
LIFEEXPF | 0.801 -0.949 0.704 0.619 -0.265 
LITERACY | 0.800 -0.930 0.637 0.562 -0.274 
СМР 86 1 0.592 -0.689 0.964 0.873 -0.560 


These correlations correspond to the lower left corner of the first matrix. 


Example 2 
Transformations 


If relationships between variables appear nonlinear, using a measure of linear 
association is not advised. Fortunately, transformations of the variables may yield 
linear relationships. You can then use the linear relation measures, but all conclusions 
regarding the relationships are relative to the transformed variables instead of the 
original variables. 

In the Pearson correlations example, we observed nonlinear relationships inv 
GDP CAP, MIL, and GNP. 86. Here we log transform these variables and compa 
resulting correlations to those for the untransformed variables. 


olving 
re the 


The input is: 


CORR 

USE OURWORLD 

LET (gdp cap,mil,gnp 86) = L10(@) 

PLENGTH LONG » i £ 

PEARSON urban birth rt gdp cap mil b to d lifeexpm lifeexp*: 
literacy gnp_86 
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Notice that we use SYSTAT’s shortcut notation to make the transformation. 
Alternatively, you could use: 


LET gdp_cap = L10(gdp_cap) 
LET mil = L10(mil) 
LET апр 86 = 110(9пр 86) 


The output is: 


Number of Observations: 49 
Means 


URBAN BIRTH RT  GDP CAP MIL ВТО D  LIFEEXPM  LIFEEXPF LITERACY  GNP 86 


52.878 25.959 3.370 1.695 2.885 65.429 70.571 74.727 3.279 


Pearson Correlation Matrix 


| URBAN BIRTH RT  GDP САР MIL В TO D  LIFEEXPM 
РМР г 1 HUMUS MU рань OM rea poda z 
URBAN | 1:000: 
BIRTH RT | -0.800 1.000 
GDP CAP | 0.764 -0.919 1.000 
MIL i 0.680 -0.801 0.895 1.000 
B TOD !-0.307 0.511 -0.529 -0.537 1.000 

LIFEEXPM | 0.776 -0.922 0.860 0.727 -0.211 1.000 
LIFEEXPF | 0.801 -0.949 0.895 0.763 -0.265 0.989 1.000 
LITERACY | 0.800 -0.930 0.834 0.714 -0.274 0.911 0.935 
GNP 86 ! 0.775 -0.879 0.974 0.877 -0.441 0.861 0.886 


Pearson Correlation Matrix (contd...) 


GDP САР 
MIL 

B TO D 
LIFEEXPM 
LIFEEXPF 
LITERACY 
GNP 86 


1.000 


! 
* 
BIRTH RT | 
| 
! 
А 
| 
! 0.840 1.000 


1-184 


Chapter 6 


Scatter Plot Matrix 


URBAN  BRTHRT СОР САР м. BTOD  LFEDPM іғеәРЕ — LITERACY _ GNP 86 


In the scatterplot matrix, linearity has improved in the plots involving GDP CAP, MIL, 
and САР 86. Look at the difference between the correlations before and after 
transformation. 


Transformation Transformation Transformation 

no yes no yes no yes 
gdp_cap vs. mil vs. gnp_86 vs. 
urban 0.625 0764 |urban 0.597 0.680 urban 0.592 0.775 
birth rt -0.762 —0.919 |birth rt -0.672 -0.801 |birth rt 0,689 -0.879 
lifeexpm 0.664 0860 lifeexpm 0.582 0.727 |lifeexpm 0.633 0.861 
lifeexpf 0.704 0895  |lifeexpf 0.619 0.763 ІМеехрГ 0.665 0.886 
literacy 0637 0834 literary 0562 0.714 literacy 061 0840 

any of 


After log transforming the variables, linearity has improved in the plots. and m 
the correlations are stronger. 


1-185 
Correlations, Associations, and Distance Measures 


Example 3 
Missing Data: Pairwise Deletion 


To specify pairwise deletion, the input is: 


LET (gdp cap,mil,gnp 86) = L10(@) 

GRAPH NONE 

PLENGTH LONG 

PEARSON urban birth rt gdp cap mil b to d lifeexpm lifeexpf, 
literacy gnp 86 / PAIR 


, 
i Ж 
1 
BIRTH RT | -0.781 1.000 
СОР CAP ; 0.778 -0.895 1.000 
MIL ! 0.683 -0.687 0.857 1.000 
B TO D ! -0.248 0.535 -0.472 -0.377 1.000 
LIFEEXPM | 0.796 -0.892 0.854 0.696 -0.172 1.000 ` 
LIFEEXPF ! 0.816 -0.924 0.891 0.721 -0.230 0.989 1.000 
LITERACY } 0.807 -0.930 0:832 0.646 -0.291 0.911 0.937 
GNP 86 | 0.775 -0.881 0.974 0.881  -0.455 0.863 0.888 


Pearson Correlation Matrix (contd...) 


' 
+ 
URBAN В 
BIRTH RT | 
GDP CAP | 
MIL } 
в. то | 
LIFEEXPM | 
LIFEEXPF | 
LITERACY } 
СМР 86 1| 


Pairwise Frequency Table 
URBAN BIRTH RT СОР CAP MIL B TO D  LIFEEXPM LIFEEXPF LITERACY 


1.000 
0.842 1.000 


] 
Н 

URBAN | 56 

BIRTH RT | 56 57 

СОР САР | 56 57 57 

MIL | 55 56 56 56 

B тоб "ү 56 57 57 56 57 

LIFEEXPM | 56 57 57 56 57 57 

LIFEEXPF | 56 57 57 56 57 57 57 

LITERACY | 56 57 57 56 57 57 57 57 

GNP 86  ! 49 50 50 50 50 50 50 50 
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Pairwise Frequency Table (contd...) 


LIFEEXPF 
LITERACY | 
СМР 86 


E 
HI 
Е 
tal 
x 
© 
X 


50 


The sample size for each variable is reported as the diagonal of the pairwise frequency 
table; sample sizes for complete pairs of cases are reported off the diagonal. There are 
57 countries іп this sample—56 reported the percentage living in cities (URBAN), and 
50 reported the gross national product per capita in 1986 (GNP_86). There are 49 
countries that have values for both URBAN and GNP 86. 

The means are printed because we specified PLENGTH LONG. Since pairwise 
deletion is requested, all available values are used to compute each mean—that 15, 
these means are the same as those computed by the Statistics procedure. 


Example 4 
Missing Data: EM Estimation 


This example uses the same variables used in the transformations example. 


The input is: 


CORR 

USE OURWORLD 

LET (gdp cap,mil,gnp 86) = L10(@) 

IDVAR country$ 

GRAPH NONE 

PLENGTH LONG 

PEARSON urban birth rt gdp cap mil b to d lifeexpm, 
lifeexpf literacy gnp 86 / EM 
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The output is: 


EM Algorithm 

Iteration Maximum Error -2*LL 
1 1.092  24135.483 

2 1.024 7625.491 

3 0.643 6932.605 

4 0.666 6691.459 

5 0.858 6573.200 

6 2.718 6538.853 

7 0.728 6531.690 

8 0.197 6530.369 

9 0.078 6530.167 

10 0.035 6530.160 

11 0.016 6530.176 

12 0.008 6530.190 

13 (5 0.004 . 6530.199 


14 0.002 6530.204 
15 0.001. 6530.207 
16 0.001 6530.209 
ЕТ 2 
No.of, Cases Missing Value 
ч Patterns 
(X=nonmissing; 


49 XXXXXXXXX 
1 + XXXXXXXX 
6  XXXXXXXX. 


1 XXX.XXXX. | 
Little MCAR Test Statistic : 35.757 
ағ : 23 
p-value : 0.044 


EM Estimate of Means 
URBAN BIRTH RT GDP_CAP MIL B TOD LIFEEXPM  LIFEEXPF LITERACY  GNP 86 


-------------І-------------------. ---2--4-------------------------- ч-----------1.. 


53.152 26.351 3.372 1.154 2.873 65.088 70.123 73.563 3.284 


EM Estimated Correlation Matrix 


4 


| URBAN BIRTH RT СРР САР MIL B TO D  LIFEEXPM LIFEEXPF 
IIZLLIIS pte che as ew nectit, Dui oue rre 
URBAN { 1.000 
BIRTH RT | -0.782 1.000 
GDP CAP | 0.779 -0.895 1.000 
MIL i 0.700 -0.697 0.863 1.000 
B TOD | -0.259 0.535 -0.472 -0.357 
LIFEEXPM | 0.796 -0.892 0.854 0.713 1.000 
LIFEEXPF | 0.816 -0.924 0.891 0.738 0.989 1.000 
LITERACY | 0.808 -0.930 0.832 0.668 . 0.911 0.937 
СЫР 86 {| 0.796 -0.831 0.968 0.874 -0.342 0.863 0.885 
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ЕМ Estimated Correlation Matrix (contd...) 


LITERACY GNP_86 


! 
+ 
URBAN — | 
BIRTH RT | 
СОР САР | 
MIL | 
втор | 
LIFEEXPM | 
LIFEEXPF | 
LITERACY | 
GNP 86 | 


1.000 
0.828 1.000 


SYSTAT prints missing-value patterns for the data. Forty-nine cases in the sample are 
complete (an X is printed for each of the nine variables). Periods are inserted where 
data are missing. The value of the first variable, URBAN, is missing for one case, while 
the value of the last variable, СУР 86, is missing for six cases. The last row of the 
pattern indicates that the values of the fourth variable, MIL, and the last variable, 
СМР 86, are both missing for one case. 

Little’s МСАК (missing completely at random) test has a probability less than 0.05, 
indicating that we reject the hypothesis that the nine missing values are randomly 
missing. This test has limited power when the sample of incomplete cases is small and 
it also offers no direct evidence on the validity of the MCAR assumption. 


Example 5 
Probabilities Associated with Correlations 


To request the usual (uncorrected) probabilities for а correlation matrix using pairwise 
deletion, the input is: 


USE OURWORLD 
CORR 
LET (gdp_cap,mil,gnp_86) = L10(@) 
GRAPH NONE қ 
PLENGTH LONG А £ 
PEARSON urban birth rt дар cap mil b to d lifeexpm lifeexP": 
literacy gnp 86 / PAIR PROB 


The output is: 


тамыры Chi-square Statistic : 815.067 
p-value ; 0 000 
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Matrix of Probabilities 


| URBAN BIRTH_RT СОР САР MIL B TO D  LIFEEXPM  LIFEEXPF LITERACY 
URBAN ! 0.000 = 
BIRTH RT | 0.000 0.000 
GDP CAP | 0.000 0.000 0.000 
MIL | 0.000 0.000 0.000 0.000 
B TO D ' 0.065 0.000 0.000 0.004 0.000 
LIFEEXPM | 0.000 0.000 0.000 0.000 0.202 0.000 
LIFEEXPF | 0.000 0.000 0.000 0.000 0.085 0.000 0.000 
LITERACY ! 0.000 0.000 0.000 0.000 0.028 0.000 0.000 0.000 
GNP 86 1 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.000 


Matrix of Probabilities (contd...) 


втор | 
LIFEEXPM 
LIFEEXPF 
LITERACY 


GNP 86 0.000 


The p-values that are appropriate for making statements regarding one specific 
correlation are shown here. By themselves, these values are not very informative. 
These p-values are pseudo-probabilities because they do not reflect the number of 
correlations being tested. If pairwise deletion is used, the problem is even worse, 
although many statistical packages print probabilities as if they meant something in 
this case, too. 

SYSTAT computes the Bartlett chi-square test whenever you request probabilities 
for more than one correlation. This tests a global hypothesis concerning the 
significance of all of the correlations in the matrix 


ха = IN-1- PT D Dn] 
6 
where N is the total sample size (or the smallest sample size for any pair in the matrix 
if pairwise deletion is used), p is the number of variables, and |R| is the determinant of 
the correlation matrix. This test is sensitive to non-normality, and the test statistic is 
only asymptotically distributed (for large samples) as chi-square. Nevertheless, it can 
serve as a guideline. 

If the Bartlett test is not significant, do not even look at the significance of individual 
correlations. In this example, the test is significant, which indicates that there may be 
some real correlations among the variables. The Bartlett test is sensitive to non- 
normality and can be used only as a guide. Even if the Bartlett test is significant, you 
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cannot accept the nominal p values as the true family probabilities associated with each 
correlation. 


Bonferroni Probabilities with Pairwise Deletion 


Let us now examine the probabilities adjusted by the Bonferroni method that provides 
protection for multiple tests. Remember that the log-transformed values from the 
transformations example are still in effect. 


The input is: 


USE OURWORLD 
CORR 
LET (gdp_cap,mil,gnp_86) = L10(@) 
GRAPH NONE 
PLENGTH LONG | 
PEARSON urban birth ге дар сар mil b to d lifeexpm lifeexpf, 
literacy gnp_86 / PAIR BONF 


The output is: 
E Chi-square Statistic : 815.067 
В 36 
p-value E 0.000 
Matrix of Bonferroni Probabilities 
i URBAN BIRTH_RT СОР САР MIL ВТОП  LIFEEXPM LIFEEXPF LITERACY _ 
ЕНЕ ru itae ааа. 2.-2.------------------------------------------“------- 
URBAN 1 0.000 
BIRTH RT | 0.000 0.000 
СОР CAP | 0.000 0.000 0.000 
MIL ! 0.000 0.000 0.000 0.000 
B TOD | 1.000 0.001 0.008 0.150 0.000 
LIFEEXPM | 0.000 0.000 0.000 0.000 1.000 0.000 
LIFEEXPF | 0.000 0.000 0.000 0.000 1.000 0.000 0.000 000 
LITERACY | 0.000 0.000 0.000 0.000 1.000 0.000 0.000 01000 
СМР 86 ! 0.000 0.000 0.000 0.000 0.032 0.000 0.000 d 


i 

— € ес cr 
URBAN ! 
BIRTH_RT } 
GDP_CAP | 
MIL i 
BTOD | 
LIFEEXPM | 
LIFEEXPF | 
LITERACY | 


GNP 86 0.000 
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Compare these results with those for the 36 tests using uncorrected probabilities. 
Notice that some correlations, such as those for B TO D with MIL, LITERACY, and 
СМР 86, аге no longer significant. 


Bonferroni Probabilities for EM Estimates 
You can request the Bonferroni adjusted probabilities for an EM estimated matrix. 


The input is: 


USE OURWORLD 
CORR 
LET (gdp cap,mil,gnp 86) = L10(@) 
GRAPH NONE 
PLENGTH LONG 
PEARSON urban birth rt gdp cap mil b to d lifeexpm lifeexpf, 
literacy gnp 86 / EM BONF 


The output is: 
Bartlett Chi-square Statistic : 821.288 
df A 36 
p-value : 0.000 


Matrix of Bonferroni Probabilities 


| URBAN — BIRTH RT  GDP CAP MIL B TOD  LIFEEXPM  LIFEEXPF LITERACY 

— „з... олылый PRS odes I tain diio M igi Ld sie ee les раз osa ава gaa T ы 

URBAN | 0.000 

BIRTH RT | 0.000 0.000 

GDP CAP ! 0.000 0.000 0.000 

MIL 1 0.000 0.000 0.000 0.000 

B TO D i 1.000 0.001 0.008 0.248 0.000 

LIFEEXPM ! 0.000 0.000 0.000 0.000 1.000 0.000 

LIFEEXPF | 0.000 0.000 0.000 0.000 1.000 0.000 0.000 

LITERACY ; 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 

GNP 86 i 0.000 0.000 0.000 0.000 0.537 0.000 0.000 0.000 


i 
Фан 
URBAN ! 
BIRTH RT | 
GDP САР | 
MIL Н 
B тор‘ 7 
LIFEEXPM | 
LIFEEXPF ! 
LITERACY | 
СМР 86 | 
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Ехатріе 6 
Hadi Robust Outlier Detection 


If only one or two variables have outliers among many well behaved variables, the 
outliers may be masked. Let us look for outliers among four variables. 


The input is: 


USE OURWORLD 
CORR 
LET (gdp_cap, mil) = L10(@) 
GRAPH NONE 
PLENGTH LONG 
IDVAR country$ 
PEARSON дар сар mil b to d literacy / HADI 
PLOT GDP CAP*B ТО | D*LITERACY / SPIKE XGRID YGRID AXES=BOOK, 
SCALE=L SYMBOL=GROUP$ SIZE- 1.250, 1.250 1.250 


The output is: 
Number of Observations: 56 


These 15 outliers are identified 


Case Distance 
Venezuela 4.487 
CostaRica 4.553 
Senegal 4.666 
Sudan 4.749 
Ethiopia 4.820 
Pakistan 5.058 
Libya 5.103 
Haiti 5.449 
Bangladesh 5.480 
Yemen 5.840 
Gambia 5.842 
Iraq 5.845 
Guinea 6.123 
Somalia 6.185 
Mali 6.301 


Means of Variables of Non-Outlying Cases 
GDP CAP MIL BTOD LITERACY 


3.634 1,967 2.533 88.183 
HADI Estimated Correlation Matrix 


! СОР САР MIL B TO D LITERACY 
АРИВ сен А ы ырынан» кш ке RR Linie 
GDP CAP | 1.000 

MIL” ! 0.860 1.000 

B TOD ) -0.839  -0.753 1.000 

LITERACY | 0.729 0.642 -0.698 1.000 
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Fifteen countries are identified as outliers. We suspect that the sample may not be 

homogeneous so we request a plot labeled by GROUPS. The panel is set to PLENGTH 
LONG; the country names appear because we specified COUNTRYS as ап ID variable. 
The correlations at the end of the output are computed using the 30 or so cases that are 


not identified as outliers. 
In the plot, we see that Islamic countries tend to fall between New World and 


European countries with respect to birth-to-death ratio and have the lowest literacy. 
European countries have the highest literacy and GDP CAP values. 


Stratifying the Analysis 


We will use Hadi for each of the three groups separately. 
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The input is: 


USE OURWORLD 
CORR 
LET (gdp cap, mil) = L10(@) 
GRAPH NONE 
PLENGTH LONG 
IDVAR countrys 
BY group$ 
PEARSON gdp сар mil b to d literacy / HADI 
BY 


For clarity, we edited the output by moving the panels of means to the end: 


The output is: 


The following results are for: GROUPS = Europe 
Number of Observations: 20 
The 1 outlier(s) identified 


Distance 


Portugal 


HADI Estimated Correlation Matrix 


1 GDP CAP MIL B TO D LITERACY 
NNI LL. BIOINDUOO COT INK Ирана ја 
GDP CAP | 1.000 
MIL ! 0.474 1.000 
B TO D. | -0.092 -0.173 1.000 
LITERACY | 0.259 0.263 0.136 1.000 


The following results are for: GROUPS = Islamic 
Number of Observations: 15 
HADI Estimated Correlation Matrix 


| СОР CAP MIL B TO D LITERACY 


GDP CAP | 1.000 

MIL ! 0.877 1.000 

B TOD | 0.781 0.882 1.000 

LITERACY | 0.600 0.605 0.649 1.000 


The following results are for: GROUPS * NewWorld 


Number of Observations: 21 
HADI Estimated Correlation Matrix 


GDP CAP MIL B TO D LITERACY 


+ 
GDP CAP ! 1,000 
{ 


MIL 0.674 1.000 
B TO D -0.246 -0.287 1.000 
LITERACY 0.689 0.561 -0.045 1.000 
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Means of Variables of Non-Outlying Cases (Europe) 
GDP_CAP MIL B TO D LITERACY 
7“ 4.059 2.404 1.260 98.316 

Means of Variables of Non-Outlying Cases (Islamic) 
GDP CAP MIL B TO D LITERACY 
- 2.164 1.400 3.547 . 36.733 

Means of Variables of Non-Outlying Cases (NewWorld) 
GDP CAP MIL В TO D LITERACY 


3.214 1.466 3.951 79.957 


When computations are done separately for each group, Portugal is the only outlier, 
and the within-groups correlations differ markedly from group to group and from those 
for the complete sample. By scanning the means, we also see that the centroids for the 
three groups are quite different. 


Example 7 
Spearman Correlations 


Asan example, we request Spearman correlations for the same data used in the Pearson 
correlation and Tranformations examples. It is often useful to compute both a 
Spearman and a Pearson matrix using the same data. The absolute difference between 
the two can reveal unusual features such as outliers and highly skewed distributions. 


The input is: 


USE OURWORLD 
CORR 
GRAPH NONE 
SPEARMAN urban birth rt gdp cap mil b to d, 
lifeexpm lifeexpf literacy gnp 86 / PAIR 
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The output is: 


Spearman Correlation Matrix 


| URBAN BIRTH RT СРР САР MIL B TO D  LIFEEXPM 
+ 
Ра: 
| =0. ч 
Г бу з 14 
MIL | 0.678 -0.670 0.848 1.000 
B TO D | -0.381 0.689 -0.597 -0.498 1.000 
LIFEEXPM | 0.731 -0.856 0.834 0.633 -0.410 1.000 
LIFEEXPF | 0.771 -0.902 0.910 0.709 -0.501 0.965 1.000 
LITERACY | 0.760 -0.868 0.882 0.696 -0.576 0.813 0.866 
СМР 86 | 0.767 -0.847 0.973 0.867 -0.543 0.834 0.901 


---------- + 
URBAN — | 
BIRTH RT | 
GDP CÀP | 

i 
і 
| 
| 
i 
! 


MIL 
B TO р 
LIFEEXPM 
LIFEEXPF 
LITERACY 
GNP 86 


0.909 1.000 
Note that many of these correlations are closer to the Pearson correlations for the log- 
transformed data than they are to the correlations for the raw data. 


Example 8 
2 and 83 Coefficients 


The choice among the binary S measures depends on what you want to state about your 
variables. In this example, we request S2 and S3 to study responses made by 256 
subjects to a depression inventory (Afifi, Clark and May, 2003). These data are stored 
in the SURVEY2 data file that has one record for each respondent with answers to 20 
questions about depression. Each subject was asked, for example, "Last week, did you 
cry less than 1 day (code 0), 1 to 2 days (code 1), 3 to 4 days (code 2), or 5 to 7 days 
(code 3)?" The distributions of the answers appear to be Poisson, so they are not 
satisfactory for Pearson correlations. Here we dichotomize the behaviors or feelings as 
"Did it occur or did it not?" by using transformations of the form: 


LET blue = blue <> 0 


The result is true (1) when the behavior or feeling is present or false (0) when itis 
absent. We use SYSTAT’s shortcut notation to do this for 7 of the 20 questions. For 
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each pair of feelings or behaviors, S2 indicates the proportion of subjects with both, 
and 53 indicates the proportion of times both occurred given that one occurs. 


The input is: 


USE SURVEY2 
CORR 
LET (blue,depress,cry,sad,no_eat,getgoing,talkless) = @ <> 0 
GRAPH NONE 
S2 blue depress cry sad no_eat getgoing talkless 
S3 blue depress cry sad no_eat getgoing talkless 


The output is: 


52 (Russell and Rao) Binary Similarity Coefficients 


Number of Observations: 256 


| BLUE DEPRESS CRY SAD NO_EAT GETGOING TALKLESS 
——— %-------------------------------------Ш------------------------- 
BLUE i 0.254 
DEPRESS 1 0.207 0.422 
CRY 1 0.090 0.113 0.133 
SAD i 0.188 0.313 0.117 0.391 
NO EAT i 0.117 0.129 0.051 0.137 0.246 
GETGOING ; 0.180 0.309 0.086 0.258 0.152 0.520 
TALKLESS ; 0.117 0.156 0.059 0.145 0.098 0.172 0.246 


$3 (Jaccard) Binary Similarity Coefficients 


Number of Observations: 256 


| BLUE DEPRESS CRY SAD МО EAT  GETGOING TALKLESS 

MÀ %--------------------------------------------------------------- 

BLUE ! 1.000 

DEPRESS | 0.442 1.000 

CRY 1 0.303 0.257 1.000 

SAD | 0.410 0.625 0.288 1.000 

NO_EAT 1 0.306 0.239 0.155 0.273 1.000 

GETGOING ; 0.303 0.488 0.152 0.395 0.248 1.000 

TALKLESS | 0.306 0.305 0.183 0.294 0.248 0.289 1.000 


Sad 
1 0 
Depress 0 20 
8 128 


For S2, the result is 80/256 = 0.313; for S3, 80/128 = 0.625. 


1-198 
Chapter 6 


Ехатріе 9 
Tetrachoric Correlation 


As an example, we use the bivariate normal data in the SYSTAT data file named 
TETRA. 


The input is: 


USE TETRA 
FREQUENCY COUNT 
CORR 

TETRA x y 


The output is: 
Number of Observations: 45 


Tetrachoric Correlations 


' 
Џ 
ee 
' 
П 
D 


“~ 


For ош single pair of variables, the tetrachoric correlation is 0.81. 


Example 10 
Unordered Data 


This example uses counts from a breast cancer study of 72 women. The study was done 
on different cities (CENTERS). The type of the tumor (TUMORS), age of the patients 
(AGE), and their survival (SURVIVES) are recorded. The variable NUMBER indicates 
the frequency of patients. We may wish to find out whether there is any association 
between the type of tumor and age. 


The input is: 


USE CANCER 

CORR 

FREQUENCY NUMBER 
CRAMER TUMORS AGE 


. 199 


Correlations, Associations, апа Distance Measures 


The output is: 
Number of Cases : 72 


Cramer Coefficient Matrix 


TUMORS | 1.000 
AGE + 0.055 1.000 


Number of observations: 72 


Observe the above table, where Cramer's V attains 1 for complete association, and here 
there is a partial association between age and the type of tumor. 


Computation 


Algorithms 


The computational algorithms use provisional means, sum of squares, and cross- 
products (Spicer, 1972). Starting values for the EM algorithm use all available values 
(see Little and Rubin, 2002, p. 42). 

For the rank-order coefficients (Gamma, Mu2, Spearman, Tau b, and Tau c), keep 
in mind that these are time consuming. Spearman requires sorting and ranking the data 
before doing the same work done by Pearson. The Gamma and Mu2 items require 
computations between all possible pairs of observations. Thus, their computing time is 


combinatoric. 


Missing Data 


If you have missing data, CORR can handle them in three ways: listwise deletion, 
pairwise deletion, and EM estimation. Listwise deletion is the default. If there are 
missing data and pairwise deletion is used, SYSTAT displays a table of frequencies 
between all possible pairs of variables after the correlation matrix. 

Pairwise deletion takes considerably more computer time because the sum of cross- 
products for each pair must be saved in a temporary disk file. If you use the pairwise 
deletion to compute an SSCP matrix, the sum of squares and cross-products are 
weighted by N/n, where N is the number of cases in the whole file and n is the number 


of cases with nonmissing values in a given pair. 
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See “Missing Value Analysis“ оп page 123 іп Statistics III for a complete 
discussion of handling missing values. 
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Chapter 


Correspondence Analysis 


Leland Wilkinson (revised by Mousum Dutta) 


Correspondence analysis allows you to examine the relationship between categorical 
variables graphically. It computes simple and multiple correspondence analysis for 
two-way and multiway tables of categorical variables, respectively. Tables are 
decomposed into row and column coordinates, which are displayed in a graph. 
Categories that are similar to each other appear close to each other in the graphs. 

Correspondence Analysis can use either a case-by-variables data format or a two- 
way frequency table format with the first column indicating row labels; we call the 
latter Smart Correspondence Analysis. 

Resampling procedures are available in this feature. 


Statistical Background 


Correspondence analysis is a method for decomposing a table of data into row and 
column coordinates that can be displayed graphically. With this technique, a two-way 
table can be represented in a two-dimensional graph with points for rows and 
columns. These coordinates are computed with a Singular Value Decomposition 
(SVD), which factors a matrix into the product of three matrices: a collection of left 
singular vectors, a matrix of singular values, and a collection of right singular 
vectors. Greenacre (1984) is the most comprehensive reference. Hill (1974) and 
Jobson (1992) cover the major topics more briefly. 
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The Simple Model 


The simple correspondence analysis model decomposes a two-way table. This 
decomposition begins with a matrix of standardized deviates, computed for each cell 
in the table as follows: 


where N is the sum of the table counts for all пу, where nj; is the observed count or 
frequency for cell ij, and ej; is the expected count for cell ij based оп an independence 
model. The second term in this equation is а cell's contribution to the у test-for- 
independence statistic. Thus, the sum of the squared z,, over all cells in the table is the 
same as x^ М. Finally, the row mass for row i is nj./N and the column mass for 
column j is nj/N. 

The next step is to compute the matrix of cross-products from this matrix of 
deviates: 


S-ZZ 
This S matrix has t = min(r—1,c—1 ) nonzero eigenvalues, where r and c are the 
row and column dimensions of the original table, respectively. The sum of these 


eigenvalues is 4^/N (which is termed total inertia). It is this matrix that is 
decomposed as follows: 


S = UDV' 


where U is a matrix of row vectors, V is a matrix of column vectors, and D is a diagonal 
matrix of the eigenvalues. The coordinates actually plotted are standardized from U 
(for rows), so that 


ХМ - У (nN у x; 


із! ізі 


Тһе coordinates are similarly standardized from V (for columns). 
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The Multiple Model 


The multiple correspondence model decomposes higher-way tables. Suppose we have 
a multiway table of dimension kı by К, by К; by.... The multiple model begins with 


an п by p matrix Z of dummy-coded profiles, where n = the total number of cases in 


the table and p = Kı + kə +k; + .... This matrix is used to create a cross-products 
matrix: 

6-72 
which is rescaled and decomposed with a singular value decompos 
Jobson (1992) for further information. 


ition, as before. See 
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Correspondence Analysis іп SYSTAT 


Correspondence Analysis Dialog Box 


To open the Correspondence Analysis dialog box, from the menus choose: 


Analyze 
Tables 
Correspondence Analysis... 


Available variable(s): Dependeni(s) 


WEIGHT ж Required: 
|| Add 


~ Remove 


Model selection and estimation are available in the Model tab of the Correspondence 
Analysis dialog box. 


Dependent(s). Select the variable(s) you want to examine. The dependent variable(s) 
should be categorical. To analyze a two-way table (simple correspondence analysis), 
select a variable defining the rows. Selecting multiple dependent variables (and no 
independent variables) yields a multiple correspondence model. 


Independent. To analyze a two-way table, select a categorical variable defining the 
columns of the table. 


Save coordinates. Saves coordinates and labels to a data file. 
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Smart Correspondence Analysis Dialog Box 


Smart Correspondence Analysis performs as Simple Correspondence Analysis when 
the data set is in the form of a two-way frequency table with row names in the first 

column. This is a SYSTAT rectangular format file, but, in this context, the first variable 
is reserved for representing the rows and other variables are columns. For example, in 
the LEISURE data set, the first variable ACTIVITY$ shows the different rows. Asubset 
of rows can be chosen through the SELECT command. Resampling procedures are not 
available in this feature. SYSTAT does not produce any command script in the log file. 


To open the Smart Correspondence Analysis dialog box, from the menus choose: 


Analyze 
Tables 
Smart Correspondence Analysis... 


Available column(s): 


ID 
SALBEG 
SEX 
TIME 
AGE 
SALNOW 
EDLEVEL 


Available column(s). Except the first variable, all available variables are listed here. 
Selected column(s). Select the column(s), that you want to examine. 


Save coordinates. Saves coordinates and labels to a data file. 
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Using Commands 


First, specify your data with USE filename. For a simple correspondence analysis, 
continue with: 
CORAN 


MODEL depvar=indvar 
ESTIMATE/SAMPLE=BOOT (m,n) or SIMPLE(m,n) or JACK 


For a multiple correspondence analysis: 


CORAN 
MODEL varlist 
ESTIMATE/SAMPLE=BOOT (m,n) or SIMPLE(m,n) or JACK 


If data are aggregated and there is a variable in the file representing frequency of 
profiles, use FREQ to identify that variable. 


Usage Considerations 


Types of data. CORAN uses rectangular data only. 
Print options. The output is the same for all PLENGTH options. 


Quick Graphs. Quick Graphs produced by CORAN are correspondence plots for the 
simple or multiple models. 


Saving files. For simple correspondence analysis, CORAN saves the row variable 
coordinates in DIM(1)...DIM(N) and the column variable coordinates in 
FACTOR(I)...FACTOR(N), where the subscript indicates the dimension number. For 
multiple correspondence analysis, D/M(1 )...DIM(N) contain the variable coordinates 


and FACTOR(1)...FACTOR(N) contain the case coordinates. Label information is 
saved to LABELS. 


BY groups. CORAN analyzes data by groups. Your file need not be sorted on the BY 
variable(s). 


Case frequencies. FREQUENCY variable increases the number of cases by the FREQ 
variable. 


Case weights. WEIGHT is not available in CORAN. 
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Examples 


The examples begin with a simple correspondence analysis of a two-way table from 
Greenacre (1984). 


Example 1 
Correspondence Analysis (Simple) 


Here we illustrate a simple correspondence analysis model. The data comprise a 
hypothetical smoking survey in a company (Greenacre, 1984). Notice that we use 
value labels to describe the categories in the output and plot. The FREQ command 


codes the cell frequencies. 


The input is: 


USE SMOKE 
LABEL STAFF / 1-"Sr.Managers",2-"Jr .Managers” , 3="Sr.Employees”, 


4="Jr.Employees”, 5-"Secretaries" 
LABEL SMOKE / 1-"None",2-"Light", 3-"Moderate" ,4-"Heavy" 
FREQ FREQ 
CORAN 
MODEL STAFF=SMOKE 
ESTIMATE 


The output is: 


Case frequencies determined by value of variable FREQ 
Categorical values encountered during processing are 


Levels 


1.000 2.000 3.000 4.000 5.000 
1.000 2.000 3.000 4.000 


Variables 


STAFF (5 levels) 
SMOKE (4 levels) 


Simple Correspondence Analysis 
Chi-square : 16.442 
df : 12.000 
Probability : 0.172 


Eigenvalues and Percent Inertia 


Cumulative 


SERT T жөн 


0.000 0.485 100.000 


Sum : 0.085 (Total Inertia) 
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Row Variable 


Sr.Managers 
Jr.Managers 
Sr.Employees 
Jr.Employees 
Secretaries 


Sr.Managers 
Jr.Managers 
Sr.Employees 
Jr.Employees 
Secretaries 


Row Variable 


Sr.Managers 
Jr.Managers 
Sr.Employees 


Jr.Employees 
Secretaries 


Column Variable Coordinates 


Light 
Moderate 
Heavy 


Light | 
| 
! 


Column Variable Squared Correlations with Factors 


Fa 


1 
i 
+ 
None П 
' 
! 
Н 
i 
Н 
! 


Coordinates 

i Mass Quality Inertia Factor 1 Factor 2 

iS е een ms eri M iioc арыы АНЖЫ cds 

| 0.057 0.893 0.003 0.066 0.194 

! 0.093 0.991 0.012 -0.259 0.243 

| 0.264 1.000 0.038 0.381 0.011 

| 0.456 1.000 0.026 -0.233 -0.058 

| 0.130 0.999 0.006 0.201 -0.079 

Contributions to Factors 

| Factor 1 Factor 2 

МА ари La > желкенін == 

i 0.003 0.214 

i 0.084 0.551 

H 0.512 0.003 

i 0.331 0.152 

i 0.070 0.081 

Squared Correlations with Factors 

i 

+ 

i 0.092 0.800 

i 0.526 0.465 

i 0.999 0.001 

i 0.942 0.058 

i 0.865 0.133 

Quality Inertia Factor 1 Factor 2 

1.000 0.049 0.393 0.030 
0.984 0.007 -0.099 -0.141 
0.983 0.013 -0.196 -0.007 
0.995 0.016 -0.294 0.198 


ctor 1 Factor 2 
0.654 0.029 
0.031 0.463 
0.166 0.002 
0.150 0.506 


ctor 1 Factor 2 
0.994 0.006 
0.327 0.657 
0.982 0.001 
0.684 0.310 
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Correspondence Plot 


02 04 


4 
04 02 


00 
DIM(1) 
For the simple correspondence model, CORAN prints the basic statistics and 
eigenvalues of the decomposition. Next are the row and column coordinates, with 
mass, quality, and inertia values. Mass equals the marginal total divided by the grand 
total. Quality is a measure (between 0 and 1) of how well a row or column point is 
represented by the first two factors. It is a proportion-of-variance statistic. See 
Greenacre (1984) for further information. Inertia is a row’s (or column’s) contribution 
to the total inertia. Contributions to the factors and squared correlations with the factors 
are the last reported statistics. 
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Ехатріе 2 
Smart Correspondence Analysis with Row-by-Column Data 


This example illustrates how to perform a simple correspondence analysis with row- 
by-column data. LEISURE data file stores a cross-classification between different 
leisure activities and different occupational status. The data set has been taken from 
Clausen (1998). The following is list of names of the different activities and 
occupational status. The SYSTAT names are within parentheses. 


Activities Occupational Status 

Sports Events (Sports) Manual (MANUAL) 
Cinema(Cinema) Low Status Non Manual (LOWNM) 
Dance / Disco (Dance) High Status Non Manual (H/GHNM) 
Cafe / Restaurant (Cafe) Farmer (FARMER) 

Theater(Theater) Student (STUDENT) 

Art Exhibition (Art) Retired(RETIRED) 

Library (Library) 

Church Service (Church) 


Smart Correspondence Analysis is based only on dialogs. There is no script to perform 
this example as it does not generate any log. To perform this example open the 
LEISURE data file. The first column (Activities$) represents the row names. Then 


open the Smart Correspondence Analysis dialog box and select all variables for 
analysis. 


The output is: 


Simple Correspondence Analysis 
Chi-square : 8.271Е%002 
ағ 


: 45.000 
Probability : 0.000 


Eigenvalues and Percent Inertia 


Н Cumulative 
Factor ! Eigenvalue Percent Percen 
«------ Oe аннан ы қаны бық cee lai sie ce ies sence coe tesis aio uiui нде 
1 ! 0.036 57.211 37,211 --------- 
2 0.020 32.243 89.454 ------- 
3 0.006 9.393 98.84 
4 0.001 1.105 99.95 
5 0.000 0.048 100.00 


um : 0.062 (Total Inertia) 


Row Variable Coordinates 


Name ! Mass 
---------- + 

Art | 0.082 
Cafe 1 0.195 
Church | 0.103 
Cinema | 0.119 
Classical | 0.038 
Dance | 0.128 
Library i 0.089 
Pop | 0.061 
Sports | 0.113 
Theater i 0.072 


Name | Factor 1 
---------- + 

Art i 0.022 
Cafe | 0.079 
Сһигсһ i 0.354 
Cinema i 0.170 
Classical ! 0.043 
Dance i 0.106 
Library f 0.010 
Pop i 0.141 
Sports i 0.073 
Theater Н 0.003 


Factor 2 


Factor 1 


Row Variable Squared Correlations with Factors 


Name | Factor 1 
---------- + 

Art Н 0.203 
СаҒе | 0.446 
Church i 0.939 
Cinema { 0.932 
Classical } 0.253 
Dance H 0.551 
Library 3 0.075 
Pop Н 0.780 
Sports i 0.617 
Theater i 0.029 


Factor 


Column Variable Coordinates 


FARMER 
HIGHNM 
LOWNM 
MANUAL | 0.152 
RETIRED | 0.180 
STUDENT | 


0.691 
0.880 
0.890 
0.963 
0.987 
0.600 


Column Variable Contributions 


FARMER | 0.005 
HIGHNM | 0.001 
LOWNM Н 0.029 
MANUAL | 0.103 
RETIRED | 0.714 
STUDENT | 0.149 


2 


Inertia Fa 


ctor 1 
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Factor 2 


Factor 2 
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Column Variable Squared Correlations with Factors 


Name | Factor 1 Factor 2 
c ao IO bic S RT Ii 
FARMER | 0.054 0.637 
HIGHNM | 0.003 0.877 
LOWNM i 0.361 0.529 
MANUAL | 0.265 0.698 
RETIRED } 0.964 0.023 
STUDENT } 0.594 0.006 


D. 
ET 0.19 0.19 0.38 


0.00 
DIM(1) 
From the coordinate positions in the above graph, we see that retired persons are . 
associated with church service, high status non manual occupations are associated with 
art exhibitions, theater and classical, students and low non manual occupations are 
associated with pop and cinema, and manual workers and farmers are associated with 
sports, dancing, and cafe. 


Example 3 
Simple Correspondence A nalysis using Raw Data 


In this example, we focus on using raw data in correspondence analysis. C RIMERW 
data file stores the information case-by-case about crimes in three different areas 1n 
Norway. The data set is given in Clausen (1998) in tabular form. 
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The input is: 


USE CRIMERW 

CORAN 

MODEL PLACE$ = CRIMES 
ESTIMATE 


The output is: 
Categorical values encountered during processing are 
Variables 


MidN NorthN Olso 
Burglary Fraud Vandalism 


PLACES (3 levels) 
CRIMES (3 levels) 


Simple Correspondence Analysis 


Chi-square : 1.663Е%003 
ағ : 4.000 
Probability : 0.000 


Eigenvalues and Percent Inertia 


Cumulative 
Eigenvalue Percent Percent 


0.025 12.561 100.000 
Sum : 0.203 (Total Inertia) 
Row Variable Coordinates 


i Mass Qualit Inertia Factor 1 Factor 2 


0.148 1.000 0.044 -0.420 0.348 


NorthN ; 0.289 1.000 0.081 -0.506 -0.161 
Olso | 0.563 1.000 0.077 0.371 -0.009 


Row Variable Contributions to Factors 


Name | Factor 1 Factor 2 
earns tC ap EE acie ЛЕНЕ 
MidN Н 0.147 0.704 
NorthN | 0.417 0.294 
0150 i 0.436 0.002 


Row Variable Squared Correlations with Factors 


Name i Factor 1 Factor 2 
пена паре блр Bos 
MidN Н 0.593 0.407 
NorthN | 0.908 0.092 
0180 i 0.999 0.001 


Name i Mass Quality Inertia Factor 1 Factor 2 
2 E PUB ала EENE, Ls Ba IE а PENNE ESE 
Burglary | 0.151 1.000 0.055 -0.512 -0.325 
Fraud | 0.358 1.000 0.109 0.550 -0.046 
Vandalism | 0.491 1.000 0.038 -0.245 0.134 


Correspondence Analysis 
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Column Variable Contributions to Factors 


Name | Factor 1 Factor 2 
Burglary | 0.223 0.627 
Fraud i 0.612 0.030 
Vandalism | 0.166 0.344 


Name | Factor 1 Factor 2 
жалы UNE ркы ы ска igi 
Burglary | 0.712 0.288 
Fraud i 0.993 0.007 
Vandalism | 0.770 0.230 


Correspondence Plot 


56 0.28 


0.00 0.28 0.56 
DIM(1) 


Example 4 
ultiple Correspondence Analysis 


This example uses automobile accident data in Alberta, Canada, reprinted in Jobson 
(1992). The categories are ordered with the ORDER command so that the output will 


show them in increasing order of severity. The data are in tabular form, so we use the 
FREQ command. 
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The input is: 


USE ACCIDENT 

FREQ FREQ 

ORDER INJURYS / SORT="None”, “Minimal” ,"Minor","Major" 
ORDER DRIVERS / SORT-"Normal" , “Drunk” 

ORDER SEATBELT$ / SORT-"Yes" ,"No" 

CORAN 

MODEL INJURY$, DRIVER$,SEATBELT$ 

ESTIMATE 


The output is: 


Case frequencies determined by value of variable FREQ 
Categorical values encountered during processing are 


Variables H 
nn — ey НИ 

INJURYS (4 levels) i None Minimal Minor Major 
DRIVERS (2 levels) + Normal Drunk 

SEATBELTS (2 levels) | Yes No 


Multiple Correspondence Analysis 
Eigenvalues and Percent Inertia 


Cumulative 


0.325 19.499 81.890 
| 0.302 18.110 1.000Е%002 


Sum : 1.667 (Total Inertia) 


Variable Coordinates 


Name | Mass Qualit Factor 2 
+ 
None | 0.303 0.351 0.031 0.189 0.008 
Minimal ! 0.018 0.251 0.315 -1.523 -1.454 
Minor i 0.012 0.552 0.322 -2.134 3.294 
Major | 0.001 0.544 0.332 -3.962 -10.976 
Normal | 0.313 0.496 0.020 0.179 0.014 
Drunk i 0.020 0.496 0.313 -2.758 -0.211 
Yes i 0.053 0.279 0.280 1.143 -0.402 
No i 0.280 0.279 0.053 -0.217 0.076 


Variable Contributions to Factors 


Name | Factor 1 Factor 2 
None i 0.029 0.000 
Minimal } 0.111 0.113 
Minor H 0.141 0.375 
Major ! 0.056 0.478 
Normal | 0.027 0.000 
Drunk i 0.414 0.003 
Yes i 0.187 0.026 
No i 0.036 0.005 
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Variable Squared Correlations with Factors 


Name i Factor 1 Factor 2 
ECT E isses alee C ај 
None i 0.350 0.001 
Minimal | 0.131 0.120 
Minor | 0.163 0.389 
Major | 0.063 0.481 
Normal | 0.493 0.003 
Drunk i 0.493 0.003 
Yes i 0.249 0.031 
No 1 0.249 0.031 


Factor 1 Factor 2 
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Correspondence Analysis 


53 | -2.184 -6.281 
54 -3.788 -6.411 
291 0.082 0.057 
56 | -1.521 -0.073 
57 | -0.853 -0.787 
58 | -2.456 -0.916 
59; -1.186 1.953 
60) -2.790 1.823 
61! -2.184 -6.281 
62; -3.788 -6.411 


This time, we get case coordinates instead of column coordinates. These are not 
included in the following Quick Graph because the focus of the graph is on the tabular 
variables and we don’t want to clutter the display. If you want to plot case coordinates, 
save the case coordinates. Cut and paste them into the Data Editor and plot them 
directly. 


Correspondence Plot 


DIM(2) 


5.5 11.0 


-11.0 55 


00 

DIM(1) 
The graph reveals a principal axis of major versus minor injuries. This axis is related 
to drunk driving and seat belt use. 
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Computation 
Algorithms 


CORAN uses a singular value decomposition of the cross-products matrix computed 
from the data. 


Missing Data 


Cases with missing data are deleted from all analyses. 
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Crosstabulation 
(One-Way, Two-Way, and Multiway) 


(Revised by Carl Bowman and Harshaprabha N Shetty) 


When variables are categorical, frequency tables (crosstabulations) provide useful 
summaries. For a report, you may need only the number or percentage of cases falling 
in specified categories or cross-classifications. At times, you may require a test of 
independence or a measure of association between two categorical variables. Or, you 
may want to model relationships among two or more categorical variables by fitting 
a loglinear model to the cell frequencies. 

Both XTAB and LOGLIN can make, analyze, and save frequency tables that are 
formed by categorical variables (or table factors). The values of the factors can be 
character or numeric. Both procedures form tables using data read from a cases-by- 
variables rectangular file or recorded as frequencies (for example, from a table in a 
report) with cell indices. In XTAB, you can request percentages of row totals, column 
totals, or the total sample size. 


XTAB provides four types of frequency tables: 


One-Way Frequency counts, percentages, and confidence intervals on cell proportions for 
single table factors or categorical variables. 

Two-Way ^ Frequency counts, percentages, tests, and measures of association for the 
crosstabulation of two factors. 

Multiway: Frequency counts, and percentages, tests and measures of association for series 

Tabulate — of two-way tables and standardized tables stratified by all combinations of 
values of a third, fourth, etc., table factor. 


Multiway: Standardized frequency counts, and partial measures of association for the 
Standardize crosstabulation of two factors by controlling the effect of test factor(s). 


Resampling procedures are available in this feature. 
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Statistical Background 


Tables report results as counts or the number of cases falling in specific categories or 
cross-classifications. Categories may be unordered (democrat, republican, and 
independent), ordered (low, medium, and high), or formed by defining intervals on a 
continuous variable like AGE (child, teen, adult, and elderly). 


Making Tables 


There are many formats for displaying tabular data. Let us examine basic layouts for 
counts and percentages. 


One-Way Tables 


Here is an example of a table showing the number of people of each gender surveyed 
about depression at UCLA in 1980. 
Values for SEX 


Male Female Total 


The categorical variable producing this table is SEXS. Sometimes, you may define 
categories as intervals of a continuous variable. Here is an example showing the 256 
people broken down by age. 


18 to 30 30 to 45 46 to 60 Over 60 Total 


Two-Way Tables 


A crosstabulation is a table that displays one cell for every combination of values on 
two or more categorical variables. Here is a two-way table that crosses the gender and 
age distributions of the tables above. 
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Crosstabulation (One-Way, Two-Way, and Multiway) 


Si с pacer asst nese eee 
18 to 30 į 53 32 85 
30 to 45 | 44 30 74 
46 to 60 | 38 26 64 
омег 60 | 17 16 33 
pcm есес TORIS eoo ETE 
Total | 152 104 256 


This crosstabulation shows relationships between age and gender, which were invisible 
in the separate tables. Notice, for example in the table below, that the sample contains 
a large number of females below the age of 46. 


Standardizing Tables with Percentages 


As with other statistical procedures such as Correlation, it sometimes helps to have 
numbers standardized on a recognizable scale. Correlations vary between —1 and 1, for 
example. A convenient scale for table counts is percentage, which varies between 0 and 
100. 

With tables, you must choose a facet on which to standardize—rows, columns, or 
the total count in the table. For example, suppose we look at the difference between the 
genders within age groups, we might want to standardize by rows. Here is that table: 


+ + 

18 to 30 | i 100.000 85.000 

30 to 45 | ! 100.000 74.000 

46 to 60 | 59.375 40.625 | 100.000 64.000 
+ + 


over 60 100.000 33.000 


100.000 


Here we see that as age increases, the sample becomes more evenly dispersed across 


the two genders. 
On the other hand, suppose we are interested in the overall distribution of age for 


each gender, we might want to standardize within columns: 


| Female Male ; Total N 
—— Нн ИНН vor eee сы 
18 to 30; 34.868 30.769 | 33.203 85.000 
30 to 45 } 28.947 28.846 ; 28.906 74.000 
46 to 60 ! 25.000 25.000 | 25.000 64.000 
over 60 | 11.184 15.385 | 12.891 33.000 
алыбым Н саа a ca I UR 
Total ! 100.000 100.000 ! 100.000 


z 
л 
~ 
о 
о 
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© 
© 
© 
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o 
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For each gender, the oldest age group appears underrepresented. 


Significance Tests and Measures of Association 


After producing a table, you may want to consider a population model that accounts 

for the structure you see in the observed table. You should have a population in mind 
when you make such inferences. Many published statistical analyses of tables do not 
explicitly deal with the sampling problem. 


One-Way Tables 


A model for these data might be that the proportion of the males and females is equal 
in the population. The null hypothesis corresponding to the model is: 


Н: Pmales™ Pfemales 


The sampling model for testing this hypothesis requires that а population contains 
equal numbers of males and females and that each member of the population has an 
equal chance of being chosen. After choosing each person, we identify the person as 
male or female. There is no other category possible and one person cannot fit under 
both categories (exhaustive and mutually exclusive). 

There is an exact way to reject our null hypothesis (called a permutation test). We 
can tally every possible sample of size 256 (including one with no females and one 
with no males). Then we can sort our samples into two piles: samples in which there 
are between 40.625% and 59.375% percent females and samples in which there are 
not. If the latter pile is extremely small relative to the former, we can reject the null 
hypothesis. 

Needless to say, this would be a tedious undertaking—particularly on a 
microcomputer. Fortunately, there is an approximation using a continuous probability 
distribution that works quite well. First, we need to calculate the expected count of 
males and females, respectively, in a sample of size 256 if p-value is 0.5. This is 128, 
or half the sample N. Next, we subtract the observed counts from these expected 
counts, square them, and divide by the expected. 


2. 052- 128) , (104—128) _ 


128 Е 128 1 
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Crosstabulation (One-Way, Two-Way, and Multiway) 


If our assumptions about the population and the structure of the table are correct, then 
this statistic will be distributed as a mathematical chi-square variable. We can look up 
the area under the tail of the chi-square statistic, beyond the sample value that we 
calculate, and if this area is small (say, less than 0.05), we can reject the null 
hypothesis. 

To look up the value, we need a degrees of freedom (df) value. This is the number 
of independent values being added together to produce the chi-square. In our case, it is 
1, since the observed proportion of men is simply 1 minus the observed proportion of 
women. If there were three categories (men, women, other?), then the degrees of 
freedom would be 2. Anyway, if you look up the value 9 with one degree of freedom 
in your chi-square table, you will find that the probability of exceeding this value is 
extremely small. Thus, we reject our null hypothesis that the proportion of males 
equals the proportion of females in the population. 

This chi-square approximation is good only for large samples. A popular rule of 
thumb is that the expected counts should be greater than 5, although they should be 
even greater if you want to be comfortable with your test. With our sample, the 
difference between the approximation and the exact result is negligible. For both, the 
probability is small. 

Our hypothesis test has an associated confidence interval. You can use SYSTAT to 


compute this interval on the population data. Here is the result: 
95 Percent Approximate Confidence Intervals Scaled as Cell Percents 
Values for SEX$ 
Male Female 
— eee Hcc one 


LOWER LIMIT | 33.613 52.064 
UPPER LIMIT | 47.687 66.150 


The lower limit for each gender is on the top; the upper limits are at the bottom. Notice 
that these two intervals do not overlap. 


Way Tables 


The most familiar test available for two-way tables is the Pearson chi-square test for 
independence of table rows and columns. When the table has only two rows or two 
columns, the chi-square test is also a test for equality of proportions. The concept of 
interaction in a two-way frequency table is similar to the one in analysis of variance. It 
is easiest to see in an example. Schachter (1959) randomly assigned 30 subjects to one 
of two groups: High Anxiety (17 subjects), who were told that they would be 
experiencing painful shocks, and Low Anxiety (13 subjects), who were told that they 
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would experience painless shocks. After the assignment, each subject was given the 
choice of waiting alone or with the other subjects. The following tables illustrate two 
possible outcomes of this study. 


No Interaction Interaction 
WAIT WAIT 
Alone Together Alone Together 


ANXIETY на | [У 
МҰ 2.1 52 


Notice in the table on the left that the number choosing to wait together relative to those 
choosing to wait alone is similar for both High and Low Anxiety groups. In the table on 
the right, however, more of the High Anxiety group chose to wait together. 

We are interpreting these numbers relatively, so we should compute row 


percentages to understand the differences better. Here are the same tables standardized 
by rows: 


No Interaction Interaction 
WAIT WAIT 
Alone Together Alone Together 
ANXIETY High 52.8 294 70.6 


Low [ 461 | 58 | 


Now we can see that the percentages are similar in the two rows in the table on the left 
(No Interaction) and quite different in the table on the right (Interaction). A simple 
graph reveals these differences even more strongly. In the following figure, the No 
Interaction row percentages are plotted on the left. 


ANXIETY ANXIETY 
4 Low ^ Low 
* High * High 


WAITS 
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Notice that the lines cross in the Interaction plot, showing that the rows differ. There 
is almost complete overlap in the No Interaction plot. 

Now, in the one-way table example above, we tested the hypothesis that the cell 
proportions were equal in the population. We can test an analogous hypothesis in this 
context—that each of the four cells contains 25 percent of the population. The problem 
with this assumption is that we already know that Schachter randomly assigned more 
people to the High Anxiety group. In other words, we should take the row marginal 
percentages (or totals) as fixed when we determine what proportions to expect in the 
cells from a random model. 

Our No Interaction model is based on these fixed marginals. In fact, we can fix 
either the row or column margins to compute a No Interaction model because the total 
number of subjects is fixed at 30. You can verify that the row and column sums in the 
above tables are the same. 

Now we are ready to compute our chi-square test of interaction (often called a test 
of independence) in the two-way table by using the No Interaction counts as expected 
counts in our chi-square formula above. This time, our degrees of freedom are still 1 
because the marginal counts are fixed. If you know the marginal counts, then one cell 
count determines the remaining three. In general, the degrees of freedom for this test 
are (rows — 1) times (columns — 1). 

Here is the result of our chi-square test. The chi-square is 4.693, with a p-value of 
0.03. On this basis, we reject our No Interaction hypothesis. 


ANXIETY$ (rows) by WAITS (columns) 


| ALONE TOGETHER Total 


aede put riis с с э. 


5 12 17 
9 4 13 
14 16 30 


Chi-square Tests of Association for ANXIETY$ and WAITS 


Test Statistic i Value df p-value 


Pearson Chi-square 
Likelihood Ratio Chi-square 
McNemar Symmetry Chi-square 
Yates Corrected Chi-square 
Fisher Exact Test (two-tail) 


Actually, we cheated. The program computed the expected counts from the observed 
data. These are not exactly the ones we showed you in the No Interaction table. They 
differ by a rounding error in the first decimal place. You can compute them exactly. The 
popular method is to multiply the total row count times the total column count 
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corresponding to a cell and dividing by the total sample size. For the upper left cell, 
this would be 17*14/30 = 7.93. 

There is one other interesting problem with these data. The chi-square is only an 
approximation and it does not work well for small samples. Although these data meet 
the minimum expected count of 5, they are nevertheless problematic. Look at the 
Fisher’s exact test result in the output above. Like our permutation test above, which 
was so cumbersome for large data files, Fisher’s test counts all possible outcomes 
exactly, including the ones that produce interaction greater than what we observed. The 
Fisher exact test p-value is not significant (0.063). On this basis, we could not reject 
the null hypothesis of no interaction, or independence. 

Yates’ chi-square test in the output is an attempt to adjust the Pearson chi-square 
statistic for small samples. While it has come into disfavor for being unnecessarily 
conservative in many instances, nevertheless, the Yates p-value is consistent with 
Fisher’s in this case (0.072). The likelihood ratio chi-square is an alternative to the 
Pearson chi-square and is used as a test statistic for loglinear models. 


Selecting a Test or Measure 


Other tests and measures are appropriate for specific table structures and also depend 
on whether or not the categories of the factor are ordered. We use 2 x 2 to denote a 
table with two rows and two columns, and r x c fora table with r rows and c columns. 
The Pearson and likelihood-ratio chi-square statistics apply to r x c tables— 
categories need not be ordered. 

The McNemar’s test of symmetry is used for r x r square tables (the number of 
rows equals the number of columns). This structure arises when the same subjects are 
measured twice as in a paired comparisons / test (say before and after an event) ог when 
subjects are paired or matched (cases and controls). So the row and column categories 
are the same, but they are measured at different times or circumstances (like the paired 
t) or for different groups of subjects (cases and controls). This test ignores the counts 
along the diagonal of the table and tests whether the counts in cells above the diagonal 
differ from those below the diagonal. A significant result indicates a greater change in 
one direction than another. (The counts along the diagonal are for subjects who did not 
change.) 

The table structure for Cohen’s kappa looks like that of McNemar’s in that the row 
and column categories are the same. But here the focus shifts to the diagonal: Are the 
counts along the diagonal significantly greater than those expected by chance alone? 
Because each subject is classified or rated twice, kappa is a measure of interrater 
agreement. 
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Another difference between McNemar and kappa is that the former is a “test” with 
a chi-square statistic, degrees of freedom, and an associated p-value, while the latter is 
a measure. Its “size” is judged by using an asymptotic standard error to construct a 
t statistic (that is, measure divided by standard error) to test whether kappa differs from 
0. Values of kappa greater than 0.75 indicate strong agreement beyond chance, 
between 0.40 and 0.79 means fair to good, and below 0.40 means poor agreement. 

Phi, Cramer's V, and contingency are measures suitable for testing independence of 
table factors as you would with Pearson's chi-square. They are designed for comparing 
results of r x c tables with different sample sizes. (Note that the expected value ofthe 
Pearson chi-square is proportional to the total table size.) The three measures are scaled 
differently, but all test the same null hypothesis. Use the probability printed with the 
Pearson chi-square to test that these measures are Zero. For tables with two rows and 
two columns (a 2 x 2 table), phi and Cramer's V are the same. 

There are other nominal measures of association that indicate the proportional 
reduction in error when values of one variable are used to predict values of the other 
variable. They are: Goodman-Kruskal's lambda (symmetric as well as asymmetric- 
column dependent and row dependent) and uncertainty coefficients (symmetric as well 
as asymmetric-column dependent and row dependent). 

There are other nominal measures of association that indicate the proportional 
reduction in error when values of one variable are used to predict values of the other 
variable. They are: Goodman-Kruskal's lambda (symmetric as well as asymmetric- 
column dependent and row dependent) and uncertainty coefficients (symmetric as well 
as asymmetric-column dependent and row dependent). 

Five of the measures for two-way tables are appropriate only when both categorical 
variables have ordered categories (always, sometimes, never or none, minimal 
moderate, severe). These are Goodman-Kruskal's gamma, Kendall's tau-b, Stuart’s 
tau-c, Spearman’s rho, and Somers’ d. The first three of these differ only in how ties 
are treated; the fourth is like the usual Pearson correlation except that the rank order of 
each value is used in the computations instead of the value itself. Somers’ d is a 
symmetric measure, where both column and row dependent measures are given. 

For 2 х 2 tables, Fisher's exact test (if п < 50) and Yates’ corrected chi-square are 
also printed. When expected cell sizes are small in a 2 x 2 table (no expected value 
less than 5), use Fisher’s exact test as described above. 

In larger contingency tables, we do not want to see any expected values less than 1.0 
or more than 20% of the values less than 5. For large tables with too many small 
expected values, there is no remedy except to combine categories or possibly omit a 
category that has very few observations. 
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Yule’s О and Yule’s У measure dominance in а 2 x 2 table. If either off-diagonal 
cell is 0, both statistics are equal (otherwise they are less than 1). These statistics are 0 
if and only if the chi-square statistic is 0. Therefore, the null hypothesis that the 
measure is 0 can be tested by the chi-square test. 


Standardized tables. Standardized tables are two way tables, which are formed from 
multiway tables. Test factor standardization is one of the applications of 
standardization, and it too statistically removes the effect of control (strata) variables 
so that the relationship between independent (row) and dependent (column) variables 
can be examined without this source of control variable(s). This allows the user to 
compare an original total association of two variables with the association of the same 
two variables after standardization where the effects of some test factor(s) have been 
statistically removed. The above measures discussed for two-way tables can also be 
applied for standardized tables of multiway tables. 


Crosstabulations in SYSTAT 


One-Way Frequency Tables Dialog Box 


To open the One-Way Frequency Tables dialog box, from the menus choose: 
Analyze 
One-Way Frequency Tables... 
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Analyze: One-Way Frequency Tables PR 


| Main | Cell Statistics | Resampling] 


| Available variable(s]: Selected variable[s]: 


Add => 


<- Remove 


[Г] Frequency distribution 
[0 Counts and percents 
Measures 
Pearson chi-square 
Confidence intervals for percents with |0 35 v confidence 
[0 Include missing values 
[Г] Save table(s]: 


The One-Way tables provide frequency distribution, counts, percentages, measures, 

tests, etc., for single table factors or categorical variables. 

m Tables. Tables can include Frequency distribution, Counts, Percents, and Counts 
and percents. Counts and percents gives the frequency counts along with their 
percentages. Frequency distribution can display the output in a listing format with 
a tabular display. 

m Measures. Measures include Pearson chi-square and confidence intervals. Pearson 
chi-square tests the equality of the cell frequencies. This test examines if all 
categories are equally likely. 

m Include missing values. You can include a category for cases with missing data. 
SYSTAT treats this category in the same fashion as the other categories. 

ш Save table(s). Saves the table(s) created for the variable(s) in the Selected 
variable(s) list to a data file. 
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One-Way Frequency Tables Cell Statistics 


In addition to frequencies and percentages, you can also request various descriptive 
statistics for any given variable in your data. 


‘Main | Cell Statistics | Resampling) 


[0 Minimum 


[0 Maximum 
[0 Sum 

E] Mean 
[50 


E Range 


E Variance 


One or more of the following statistics for a chosen variable can be displayed for each 
cell in the table: 


Minimum. Displays the minimum value within each cell for the selected variable. 
Maximum. Displays the maximum value within each cell for the selected variable. 
Sum. Displays the sum within each cell for the selected variable. 

Mean. Displays the cell means for the selected variable. 

SD. Displays the standard deviation within each cell for the selected variable. 


Range. Displays the range within each cell for the selected variable. 
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Variance. Displays the variance within each cell for the selected variable. 


Two-Way Tables Dialog Box 


To open the Two-Way Tables dialog box, from the menus choose: 


Analyze 
Tables 
Two-Way... 


18 Analyze: Tables: Two-Way 


Main | Measures Cell Statistics | Resampling 


Available variable(s]: Row variable(s): 
ID 
SEX J 
AGE <- Remove 
MARITAL JF E 


EDUCATN Add : Column variable: 
EMPLOY — | «Required» 


= ИШИН a «- Remove | 
Tables 
Counts [Г] Counts and percents 
(Percents O Expected counts 
[Г] Row percents C Deviates 
ГІ Column percents [Г] Standardized deviates 


TES 


Options 
[Г] List layout [7 Include missing values 


ГГ Shade values Threshold [e 


C Save: | ТАЗЫ) 


The Two-Way tables crosstabulate one ог more categorical row variables with a 
categorical column variable. 


m Row variable(s). The variables displayed in the rows of the crosstabulation. Each 
row variable is crosstabulated with the column variable. 
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= Column variable. The variable displayed in the columns of the crosstabulation. 
The column variable is crosstabulated with each row variable. 


= Tables. Tables can include counts, percents (row, column, or total), expected 
counts, deviates (Observed-Expected), and standardized deviates (Observed- 
Expected) / SOR (Expected) and counts and percents. 


= Options. You can include counts and percentages for cases with missing data. In 
addition, you can display output in a listing format with a tabular display. The 
listing includes counts, cumulative counts, percentages, and cumulative 
percentages for each combination of row and column variable categories. You can 
shade the tables such as counts, row percents, column percents, percents, expected 
counts and standardized counts based on threshold value. The default threshold 
value is 4. 


Ш Save table(s). You can save either the crosstabulation or the measures related to it, 
to a data file. In the case of crosstabulation, SYSTAT saves, for each cell of the 
table, a record with the cell frequency and the row and column category values. In 
the case of measures, rows are created for all requested statistics, and columns are 
created for all requested two-way tables. 


Two-Way Tables Measures 


А wide variety of measures is available for testing the association between variables in 
a crosstabulation. Each measure is appropriate for a particular table structure (rows by 
columns), and a few assume that categories are ordered (ordinal data). 
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18 Analyze: Tables: Two-Way 


Pearson chi-square. For tables with any number of rows and columns, tests for 


| Main | Measures | Cell Statistics | Resampling) | 


Pearson chi-square 

2 x 2 tables 

Б Yates' corrected chi-square 
[Г] Fisher's exact test 

[E] Odds ratio 

El Yule's 0 and Y 


r x с tables, unordered levels 


Phi 


C Cramer's У 

[Г] Contingency coefficient 
[Г] Goodman-Kruskal's lambda 
[0 Uncertainty coefficient 


[E Likelihood ratio chi-square 

2 8k tables 

[Г] Cochran's test of linear trend 
r xr tables 

Е McNemar's test for symmetry 
[Cohen's kappa 

r x с tables, ordered levels 

[Г] Goodman-Kruskal's gamma 
[0 Kendall's tau-b 

C Stuart's tau-c 

[Г] Spearman's tho 

[Г] Somers' d 


independence of the row and column variables. 


Likelihood ratio chi-square. An alternative to the Pearson chi-square, primarily used 
as a test statistic for loglinear models. 


2 x 2 tables. For tables with two rows and two columns, the available tests are: 


Yates’ corrected chi-square. Adjusts the Pearson chi-square statistic for small 


samples. 


Fisher’s exact test. Counts all possible outcomes exactly. When the expected cell 


sizes are small (less than 5 


), use this test as an alternative to the Pearson chi-square. 


Odds ratio. A measure of association in which a value near 1 indicates no relation 


between the variables. 
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Yule’s Q and Y. Measures of association in which values near —] or +1 indicate а 
strong relation. Values near 0 indicate no relation. Yule’s У is less sensitive to 
differences in the margins of the table than Q. 


2 x k tables. For tables with only two rows and any number of ordered column 
categories (or vice versa), Cochran's test of linear trend is available to reveal whether 
proportions increase (or decrease) linearly across the ordered categories. 


r x r tables. For square tables, the available tests include: 


McNemar’s test for symmetry. Used for paired (or matched) variables. Tests 
whether the counts above the table diagonal differ from those below the diagonal. 
Small probability values indicate a greater change in one direction. 


Cohen's kappa. Commonly used to measure agreement between two judges rating 
the same objects. Tests whether the diagonal counts are larger than expected. 
Values of kappa greater than 0.75 indicate strong agreement beyond chance, values 
between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor 
agreement. 


r x c tables, unordered levels. For tables with any number of rows or columns with no 
assumed category order, the available tests are: 


Phi. A chi-square based measure of association. Values may exceed 1. 


Cramer's V. A measure of association based on the chi-square. The value ranges 
between 0 and 1, with 0 indicating independence between the row and column 
variables and values close to 1 indicating dependence between the variables. 


Contingency coefficient. A measure of association based on the chi-square. Similar 
to Cramer's V, but values of 1 cannot be attained. 


Goodman-Kruskal's lambda and Uncertainty coefficient. These are measures of 
association that indicate the proportional reduction in error when values of one 
variable are used to predict values of the other variable. For column dependent 
measures, values near 0 indicate that the row variable is no help in predicting the 
column variable. SYSTAT also gives row dependent and symmetric measures. 


r x c tables, ordered levels. For tables with any number of rows or columns in which 
categories for both variables represent ordered levels (for example, low, medium, 
high), the available measures are: 


Spearman's rho. Similar to the Pearson correlation coefficient, but uses the ranks 
of the data rather than the actual values. 


Goodman-Kruskal's gamma, Kendall’s tau-b, and Stuart's tau-c. Measures of 
association between two ordinal variables that range between —1 and +1, differing 
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only in the method of dealing with ties. Values close to 0 indicate little or no 
relationship. 

m Somers’ d. An asymmetric measure of association between two ordinal variables 
that ranges from —1 to 1. Values close to –1 or +1 indicate a strong relationship 
between the variables. Both column and row dependent measures are displayed. 


Confidence for measures. You can specify a confidence level for confidence intervals 
of the following : 

Odds ratio, Yule's О and Y, Uncertainty, Goodman-Kruskal's lambda, Cohen's 
kappa, Spearman's rho, Goodman-K ruskal's gamma, Kendall's tau-b, Stuart's tau-c, 
and Somers' d. 

The default value is 0.95. 


Two-Way Tables Cell Statistics 


In addition to frequencies and percents, tables can also include various descriptive 
statistics for any given variable in your data. 
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18 Analyze: Tables: Two-Way 


| Main | Measures} Cell Statistics | Resampling] 


[Г] Minimum 


[Г] Maximum 
[0 ит 
(Mean 
[050 

[0 Range 
[0 Variance 


One or more of the following statistics for а chosen variable сап ђе displayed for each 
cell in the table: 


Minimum. Displays the minimum value within each cell for the selected variable. 
Maximum. Displays the maximum value within each cell for the selected variable. 
Sum. Displays the sum within each cell for the selected variable. 

Mean. Displays the cell means for the selected variable. 

SD. Displays the standard deviation within each cell for the selected variable. 
Range. Displays the range within each cell for the selected variable. 


Variance. Displays the variance within each cell for the selected variable. 
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Multiway Tables: Tabulate Dialog Box 


Multiway tables: Tabulate provides frequency counts and percentages for a series of 
two-way tables stratified by all combinations of values of a third, fourth, etc., table 
factor. 


To open the Multiway Tables: Tabulate dialog box, from the menus choose: 


Analyze 
Tables 
Multiway 
Tabulate... 


Tables: Multiway: Tabulate f 
Main | Cell Statistics|| Resampling "m 
Available variable(s]: к Row variable: 
Add -> «Required» | 
< -Remove ССС | 


___ Column variable: 
Add > «Required» 
<- Remove 
trata varial j 
мауы Strata are 
© Separate 


<- Remove © Crossed 


work | - 
м 


агугус А т. 


| 
O TE 
Tables Options 
Counts [Г] Row percents [C List layout 
О Percents (Column percents [C] Include missing values 
[Г] Counts and percents 


[0 Mantel-Haenszel test for 2 x 2 sub-tables 
[Г] Save table(s}: 


OKK 


Row variable. The variable displayed in the rows of the crosstabulation. 


Column variable. The variable displayed in the columns of the crosstabulation. 
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Strata variable(s). If strata аге separate, a separate crosstabulation is produced for each 
value of each strata variable. If strata are crossed, a separate crosstabulation is 
produced for each unique combination of strata variable values. For example, if you 
have two strata variables, each with five categories, Separate will produce 10 tables 
and Crossed will produce 25 tables. 


Tables. You can display counts, percents, row percents, column percents and counts 
and percents. 


Mantel-Haenszel test for 2x2 sub-tables. You can use the Mantel-Haenszel test for 
sub-tables to test for an association between two variables while keeping another 
variable in control. 


Options. You can include counts and percents for cases with missing data. In addition, 
you can display output in a listing format, including percentages and cumulative 
percentages, with a tabular display. 


Save table(s). Saves the crosstabulation(s) to a data file. For each cell of the table, 
SYSTAT saves a record with the cell frequency and the row, column and strata 
category values. 


Multiway Tables: Tabulate Cell Statistics 


In addition to frequencies and percentages, tables can also include various descriptive 
statistics for any given variable in your data. 
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38 Tables: Multiway: Tabulate 


[Г] Minimum D 


Maximum 


Range 


Variance 


One or more of the following statistics for a chosen variable can be displayed for each 
cell in the table: 


Minimum. Displays the minimum value within each cell for the selected variable. 
Maximum. Displays the maximum value within each cell for the selected variable. 
Sum. Displays the sum within each cell for the selected variable. 

Mean. Displays the cell means for the selected variable. 

SD. Displays the standard deviation within each cell for the selected variable. 
Range. Displays the range within each cell for the selected variable. 


Variance. Displays the variance within each cell for the selected variable. 
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Multiway Tables: Standardize Dialog Box 


Multiway tables: Standardize provides Standardized frequency counts for series of 
two-way tables stratified by all combinations of values of a third, fourth, etc., table 
factor. Test factor Standardization is an application of standardization and it 
statistically removes the effect of control variables so that the relationship between 
independent and dependent variables can be examined without these control variables. 


To open the Multiway Tables: Standardize dialog box, from the menus choose: 


Analyze 
Tables 
Multiway 
Standardize... 


18 Tables: Multiway: Standardize 
| Main | Partial Measures | Resampling) _ 

Available variable(s} Row variable: 

ID Д M «Required» 

SEX B n 

AGE Column variable: 

MARITAL Add -> 


EDUCATN 
EMPLOY o QE TS 
INCOME Skea valle 
| RELIGION Miis «Required ae 
| BLUE 2 
| я 


© Separate 
m cM А Remove 
"т Та Me 
Tables 
I Cours [0 Column percents О ремајег 
O Percents [0 Counts and percenis | [09 кии 
—.] Row percents [0 Expected counts 


мәме нен [I] 


<Required> 


Row variable. The variable displayed in the rows of the crosstabulation. 


Column variable. The variable displayed in the columns of the crosstabulation. 
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Strata variable(s). If strata are separate, a separate crosstabulation is produced for each 
value of each strata variable. If strata are crossed, a separate crosstabulation is 
produced for each unique combination of strata variable values. For example, if you 
have two strata variables, each with five categories, Separate will produce 10 tables 
and Crossed will produce 25 tables. 


Tables. Tables can include counts, percents (row, column, or total), expected counts, 
deviates (Observed-Expected), and standardized deviates (Observed-Expected) / SOR 
(Expected) and Table of counts and percents. 


Shade values. The values of the tables such as counts, row percents, column percents, 
percents, expected counts, and standardized deviates can be shaded using the threshold 
value. The default threshold value is 4.The shades of red indicate a negative 
standardized residual value where as shades of blue indicate a positive standardized 
residual value. 


Save table(s). You can save either the crosstabulation or the partial measures related to 
it, to a data file. In the case of crosstabulation, SYSTAT saves, for each cell of the table, 
a record with the cell frequency and the row and column category values. In the case 
of measures, rows are created for all requested measures, and a column is created for 
the standardized two-way table. 
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Multiway Tables: Standardize - Partial measures 


Main | Partial Measures 


Pearson chi-square 

2 х 2 tables 

О Yates’ corrected chi-square 
[Fisher's exact test 

[Г] 044: ratio 

ГП Yule's 0 and Y 


xc tables, unordered levels 
C Phi 
[0 Cramer's V 
[0 Contingency coefficient 

| E] Goodman-Kruskaf's lambda 
[0 Uncertainty coefficient 


Confidence for measures 


18 Tables: Multiway: Standardize 


E] Likelihood ratio chi-square 
2xk tables 
(Cochran's test of linear trend 


r xr tables 
[Г] McNemar's test for symmetry 


| E Cohen's kappa 


t x € tables, ordered levels 
[0 Goodman-Kruskal's gamma 


| [Г] Кепдаїз tau-b 


| E Stuart's tau-c 
[Spearman's tho 
[Г] зотего' d 


Pearson chi-square. For tables with any number of rows and columns, tests for the 
independence of the row and column variables. 


Likelihood ratio chi-square. An alternative to the Pearson chi-square, primarily used 
as a test statistic for log-linear models. 


2 x 2 tables. For tables with two rows and two columns, the available tests are: 


m Yates’ corrected chi-square. Adjusts the Pearson chi-square statistic for small 
samples. 


m Fisher's exact test. Counts all possible outcomes exactly. When the expected cell 
sizes are small (less than 5), use this test as an alternative to the Pearson chi-square. 


и Odds ratio. A measure of association in which a value near 1 indicates no relation 
between the variables. 
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Yule's Q and Y. Measures of association in which values near -1 or +1 indicate a 
strong relation. Values near 0 indicate no relation. Yule's Y is less sensitive to 
differences in the margins of the table than Q. 


2 x k tables. For tables with only two rows and any number of ordered column 
categories (or vice versa), Cochran's test of linear trend is available to reveal whether 
proportions increase (or decrease) linearly across the ordered categories. 


r x r tables. For square tables, the available tests include: 


McNemar's test for symmetry. Used for paired (or matched) variables. Tests 
whether the counts above the table diagonal differ from those below the diagonal. 
Small probability values indicate a greater change in one direction. 


Cohen's kappa. Commonly used to measure agreement between two judges rating 
the same objects. Tests whether the diagonal counts are larger than expected. 
Values of kappa greater than 0.75 indicate strong agreement beyond chance, values 
between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor 
agreement. 


r x c tables, unordered levels. For tables with any number of rows or columns with no 
assumed category order, the available tests are: 


Phi. A chi-square based measure of association. Values may exceed 1. 


Cramer's V. A measure of association based on the chi-square. The value ranges 
between 0 and 1, with 0 indicating independence between the row and column 
variables and values close to 1 indicating dependence between the variables. 


Contingency coefficient. A measure of association based on the chi-square. Similar 
to Cramer's V, but values of 1 cannot be attained. 


Goodman-Kruskal's lambda and Uncertainty coefficient. These are measures of 
association that indicate the proportional reduction in error when values of one 
variable are used to predict values of the other variable. For column dependent 
measures, values near 0 indicate that the row variable is of no help in predicting the 
column variable. SYSTAT also gives row dependent and symmetric measures. 


r x c tables, ordered levels. For tables with any number of rows or columns in which 
categories for both variables represent ordered levels (for example, low, medium, 
high), the available tests are: 


Spearman's rho. Similar to the Pearson correlation coefficient, but uses the ranks 
of the data rather than the actual values. 


Goodman-Kruskal's gamma, Kendall's tau-b, and Stuart's tau-c. Measures of 
association between two ordinal variables that range between -1 and +1, differing 


1-244 
Chapter 8 


only in the method of dealing with ties. Values close to 0 indicate little or no 
relationship. 

= Somers' d. An asymmetric measure of association between two ordinal variables 
that ranges from -1 to 1. Values close to -1 or +1 indicate a strong relationship 
between the variables. Both column and row dependent measures are displayed. 


Confidence for measures. You can specify a confidence level for confidence intervals 


of the following: 
Odds ratio, Уше О and Y, Uncertainty, Goodman-Kruskal's lambda, Cohen's 


kappa, Spearman's rho, Goodman-Kruskal's gamma, Kendall's tau-b, Stuart's tau-c, 
and Somers' d. 
The default value is 0.95. 


Using Commands 


For one-way tables in XTAB, specify: 


XTAB 
USE filename 
SAVE filename 
PLENGTH NONE/ FREQ CHISQ LIST PERCENT TCP LIST 
or 
PLENGTH SHORT 


PLENGTH LONG 
TABULATE varlist / CONFI=n MISS, 
MINIMUM=varname MAXIMUM=varname, 
SUM= varname, 
MEAN=varname SD=varname, 
RANGE=varname VARIANCE=varname, 
SAMPLE=BOOT(m,n) or SIMPLE(m,n) or JACK 


Short options can be 

LIST CHISQ 
Medium options can be 

LIST CHISQ FREQ PERCENT 
Long options can be 

LIST CHISQ FREQ PERCENT TCP 


For two-way tables in XTAB, specify: 
XTAB 


USE filename 
SAVE filename/TABLES or MEASURES 
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PLENGTH NONE/ FREQ CHISQ LRCHI YATES FISHER ODDS YULE, 
COCHRAN MCNEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA, 
TAUB TAUC SOMERS EXPECT DEVI STAND LIST PERCENT ROWPCT, 
COLPCT TCP 
or 
PLENGTH SHORT 
or 
PLENGTH MEDIUM 
OR 
PLENGTH LONG 
TABULATE rowvar * colvar / CONFI=u MISS, 
SHADE=threshold, 
MINIMUM=varname, MAXIMUM=varname, 
SUM=varname MEAN=varname 
SD=varname, RANGE=varname, 
VARIANCE=varname, 
SAMPLE=BOOT (m,n) or SIMPLE (m,n), 
or JACK 
Short options can be 
FREQ PERCENT CHISQ 
Medium options can be 
FREQ PERCENT CHISQ LIST STAND MEASURES 
Long options can be 
FREQ PERCENT CHISQ LIST STAND MEASURES 
TCP ROWPCT COLPCT EXPECT DEVI 


For multiway tables: Tabulate in XTAB, specify: 


XTAB 


USE filename 

SAVE filename 

PLENGTH NONE / FREQ MANTEL LIST PERCENT ROWPCT COLPCT TCP 
or 

PLENGTH SHORT 

or 

PLENGTH MEDIUM 

or 

PLENGTH LONG 


TABULATE varlist * rowvar * colvar / MISS, 
MINIMUM=varname, 


MAXIMUM=varname, 
SUM=varname, 
MEAN=varname, 
SD=varname, 
RANGE=varname, 
VARIANCE=varname, 
SAMPLE=BOOT (m,n), 
or SIMPLE(m,n), or 
JACK 


Short options can be 
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FREQ PERCENT 
Medium options can be 
FREQ PERCENT LIST MANTEL 
Long options can be 
FREQ PERCENT LIST MANTEL TCP ROWPCT COLPCT 


MEASURES can be 
LRCHI YATES FISHER ODDS YULE COCHRAN, 


MCNEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA, 
TAUB TAUC SOMERS 


For multiway tables: Standardize in XTAB, specify: 


XTAB 
USE filename 
SAVE filename / TABLES or MEASURES 


PLENGTH NONE / FREQ CHISQ LRCHI YATES FISHER, 
ODDS YULE COCHRAN, 


MCNEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA, 


TAUB TAUC SOMERS EXPECT DEVI STAND PERCENT, 
ROWPCT COLPCT 


STD varlist*rowvar*colvar/CONFI-u SHADE-threshold, 
SAMPLE-BOOT, 
(m,n) or SIMPLE, 
(m,n) or JACK 


Usage Considerations 


Types of data. There are two ways to organize data for tables: 
ш The usual cases-by-variables rectangular data file 
ш Cell counts with cell identifiers 


For example, you may want to analyze the following table reflecting application results 
by gender for business schools: 


Admitted Denied 
Male 420 90 
Female 150 25 
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A cases-by-variables data file has the following form: 


PERSON GENDER$ STATUS$ 


1 female admit 

2 male deny 

3 male admit 
(etc.) 

684 female deny 

685 male admit 


Instead of entering one case for each of the 685 applicants, you could use the second 
method to enter four cases: 


GENDER$ STATUSS COUNT 


male admit 420 
male deny 90 
female admit 150 
female deny 25 


For this method, the cell counts іп the third column are identified by designating 
COUNT as a FREQUENCY variable. 


Print options. Three levels of output are available. The statistics produced depend on 
the dimensionality of the table. PLENGTH SHORT yields the output in the list layout 
for a one-way table, frequency and percentage tables for two-way and multiway tables, 
and the Pearson chi square for one-way and two-way tables. PLENGTH MEDIUM yields 
all measures appropriate for the dimensionality of a two-way and multiway table. It 
also gives all short outputs for all the tables along with frequency and percentage tables 
for one-way, standardized deviates for two-way, list layout for two-way and multiway 
tables, PLENGTH LONG yields medium output and the Table of Counts and Percents 
(TCP) for all tables, row percents and column percents for two-way and multiway 
tables, and expected values and deviate values for two-way tables. 


Quick Graphs. Frequency tables do not produce any Quick Graphs. 


Saving files. You can save the frequency counts and any other requested cell values to 
a file. For two-way tables, you can also save one or more Measures. 


BY groups. Use of a BY variable yields separate frequency tables (and corresponding 
statistics) for each level of the BY variable. 
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Case frequencies. XTAB uses the FREQUENCY variable to duplicate cases. This is the 
preferred method of input when the data are aggregated. 


Case weights. WEIGHT is available for frequency tables. 


Examples 


Example 1 
One-Way Tables 


This example uses questionnaire data from a community survey (Afifi et al., 2003). 
The SURVEY? data file includes a record (case) for each of the 256 subjects in the 
sample. We request frequencies for gender, marital status, and religion. The values of 
these variables are numbers, so we add character identifiers for the categories. 


The input is: 


USE SURVEY2 
XTAB 
LABEL SEX / 1='Маје', 2-'Female' 
LABEL MARITAL / 1-'Never', 2='Married', 3-'Divorced' А 
4='Separated' 
LABEL RELIGION / 1='Protestant', 2-'Catholic' , 3='Jewish', 
4-'None', 6-'Other' 
PLENGTH NONE / FREQ 
TABULATE SEX 
TABULATE MARITAL 
TABULATE RELIGION 


Ifthe words male and female were stored in the variable SEX$, you would omit LABEL 


and tabulate SEX$ directly. If you omit LABEL and specify SEX, the numbers would 
label the output. 


= When using the Label dialog box, you can omit quotation marks around category 
names. With commands, you can omit them if the name has no embedded blanks 
or symbols (the name, however, is displayed in uppercase letters). 
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The output is: 
Counts 
Values for SEX 


Male Female Total 


Values for MARITAL 


Never Married Divorced Separated Total 


Counts 
Values for RELIGION 


Protestant Catholic Jewish None Other Total 


In this sample of 256 subjects, 152 are females, 127 are married, and 133 are 
Protestants. 


Frequency Distribution 


List layout produces an alternative layout for the same information. Percentages and 
cumulative percentages are part of the display. 


The input is: 


USE SURVEY2 
XTAB 
LABEL SEX 7. 1=2"Male', 2='Female' 
LABEL MARITAL / 1='Never', 2='Married', 3='Divorced', 
4='Separated' 
LABEL RELIGION / 1='Protestant', 2='Catholic', 3='Jewish', 
4='None', 6='Other' 
PLENGTH NONE / LIST 
TABULATE SEX MARITAL RELIGION 
PLENGTH 
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The output is: 
Frequency Distribution for SEX 
SEX | Frequency Cumulative Frequency Percent Cumulative Percent 
ау a pS Rapa ыга, nei E eR жщз їс. Ns eee scien 
Male i 104 104 40.625 40.625 
Female | 152 256 59.375 100.000 
Frequency Distribution for MARITAL 
MARITAL | Frequency Cumulative Frequency Percent Cumulative Percent 
pape ране ад поре вере ЖА жел sso pn Due LT D Lt ep Cen MA re 
Never Н 73 73 28.516 28.516 
Маггіеа i 127 200 49.609 78.125 
Divorced | 43 243 16.797 94.922 
Separated ! 13 256 5.078 100.000 
Frequency Distribution for RELIGION 
RELIGION i Frequency Cumulative Frequency Percent Cumulative Percent 
----------- %--------------------------------------------------- 
Protestant | 133 133 51.953 51.953 
Catholic H 46 179 17.969 69.922 
Jewish i 23 202 8.984 78.906 
None Н 52 254 20.313 99.219 
Other, i 2 256 0.781 100.000 
Almost 60% (59.4) of the subjects are female, approximately 50% (49.6) are married, 
and more than half (52%) are Protestants. 
Example 2 


Confidence Intervals for One-Way Table Percentages 


If your data are binomially or multinomially distributed, you may want confidence 
intervals on the cell proportions. SYSTAT’s confidence intervals are based on an 
approximation by Bailey (1980). XTAB uses that reference’s approximation number 6 
with a continuity correction, which closely fits the real intervals for the binomial on 
even small samples and performs well when population proportions are near 0 or 1. 
The confidence intervals are scaled on a percentage scale for compatibility with the 
other XTAB output. 

Here is an example using data from Davis (1977) on the number of buses failing 
after driving a given distance (1 of 10 distances). Print the percentages of the 191 buses 
failing in each distance category to see the cover of the intervals. 
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The input is: 


USE BUSES 

XTAB 
FREQUENCY COUNT 
PLENGTH NONE / FREQ PERCENT 
TABULATE DISTANCE / CONFI=.95 


The output is: 
Counts 


Values for DISTANCE 


Percents of Total Count 


Values for DISTANCE 


3.141 5.759 8.377 13.089 17.801 24.084 17.277 


Values for DISTANCE 


100.000 191.000 


8.377 1.047 1.047 


ay ee 


95 Percent Approximate Confidence Intervals Scaled as Cell Percents 


Values for DISTANCE 


LOWER LIMIT 0.548 1.903 3.552 6.905 10.560 15.737 10.142 
UPPER LIMIT 8.234 11.875 15.259 20.996 26.447 33.420 25.852 


-— —— DEL а с... 0n ЖЕ Ж. -—-——— 
i 
Values for DISTANCE 


LOWER LIMIT 2,552 0.000 0.000 


PORE NEM. AD ws 
UPPER LIMIT | 15.259 4.914 4.914 


There are 6 buses in the first distance category; this is 3.14% of the 191 buses. The 
confidence interval for this percentage ranges from 0.55 to 8.23%. 


Frequency Distribution 


The input is: 


PLENGTH NONE / PERCENT LIST 
TABULATE DISTANCE / CONFI-0.95 
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The output is: 


Frequency Distribution for DISTANCE 


DISTANCE | Frequency Cumulative Percent Cumulative 
i Frequency Percent 
ED LER NUUS i cepi ВЕНЕ НИШ IC e sis uda 
1 i 6 6 3.141 3.141 
2 i 11 17 5.759 8.901 
3 i 16 33 8.377 17.277 
4 i 25 58 13.089 30.366 
5 i 34 92 17.801 48.168 
6 i 46 138 24.084 72.251 
7 i 33 171 17.277 89.529 
8 i 16 187 8.377 97.906 
9 i 2 189 1.047 98.953 
10 i 2 191 1.047 100.000 
Percents of Total Count 
Values for DISTANCE 
1 2 3 4 5 6 7 


3.141 5.759 8.377 13.089 17.801 24.084 17.277 
Values for DISTANCE 
8 9 


8.377 1.047 1.047 


100.000 191.000 


95 Percent Approximate Confidence Intervals Scaled as Cell Percents 


Values for DISTANCE 


LOWER LIMIT | 0.548 1.903 3.552 6.905 10.560 15.737 10.142 
UPPER LIMIT ) 8.234 11.875 15.259 20.996 26.447 33.420 25.852 


Values for DISTANCE 


LOWER LIMIT 
UPPER LIMIT 


3.552 0.000 0.000 
15.259 4.914 4.914 
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Example 3 
Two-Way Tables 


This example uses the SURVEY? data to crosstabulate marital status against religion. 
The input is: 


USE SURVEY2 
XTAB 
LABEL MARITAL / 1='Never', 2='Married', 3='Divorced', 
4='Separated' 
LABEL RELIGION / 1='Protestant', 2='Catholic', 3='Jewish', 
4-'None', 6-'Other' 
PLENGTH NONE / FREQ STAND 
TABULATE MARITAL * RELIGION / SHADE-2 


The output is: 


Counts 
MARITAL (rows) by RELIGION(columns) 


In the sample of 256 people, 73 never married. Of the people that have never married, 
29 are Protestants (the cell in the upper left corner), and none are in the Other category 
(their religion is not among the first four categories). The Totals (or marginals) along 
the bottom row and down the far right column are the same as the values displayed for 


one-way tables. 


The output is: 
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Standardized Deviates: (Observed-Expected)/ SOR(Expected) 
MARITAL(rows) by RELIGION(columns) 
1: : 3: 4: 


Observe the shading of values іп the crosstabulation of religion vs. marital status. 
Protestants, Other groups and None deviate from the model of independence that 
religion and marital status are independent (red and blue shades) where as Catholic and 
Jewish support the model of independence (grey shade). 


Omitting Sparse Categories 


There are only two counts in the last column, and the counts in the last row are fairly 
sparse. It is easy to omit rows and/or columns. You can: 


Ш Omit the category codes from the LABEL request. 
= Select cases to use. 


Note that LABEL and SELECT remain in effect until you turn them off. If you request 
several different tables, use SELECT to ensure that the same cases are used in all tables. 
The subset of cases selected via LABEL applies only to those tables that use the 
variables specified with LABEL. To turn off the LABEL specification for RELIGION, 
for example, specify: 


LABEL RELIGION 


We continue from the last table, eliminating the last category codes for MARITAL and 
RELIGION: 


SELECT MARITAL <> 4 AND RELIGION <> 6 
PLENGTH NONE/FREQ 

TABULATE MARITAL * RELIGION 

SELECT 


The output is: 
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Data for the following results were selected according to 
SELECT MARITAL <> 4 AND RELIGION <> 6 


Counts 


MARITAL (rows) by RELIGION(columns) 


---------- + 
Never H 
Married | 

Divorced | 

* 


List Layout 


The following is the panel for marital status crossed with religious preference. 


3 4 Total 


1 2 
29 16 
76 102% 
21 6 

125 43 


The input is: 


USE SURVEY2 


XTAB 


RECODE MARITAL1$= MARITAL 


3='Divorced' 


RECODE RELIGION1$=RELIGION / 1-'Protestant' 


3='Jewish' 
PLENGTH NONE / LIST 
TABULATE MARITAL1$ * RELIGION1$ 
PLENGTH 


The output is: 


Frequency Distribution 


MARITAL1$ 


Divorced 
Divorced 
Divorced 
Married 
Married 
Married 
Never 
Never 
Never 


RELIGION1$ 


Catholic 
Jewish 
Protestant 
Catholic 
Jewish 
Protestant 
Catholic 
Jewish 
Protestant 


for MARITAL1$ (rows) by RELIGION1$ (columns) 


re T cem 5 


8 20 
11 19 
3 13 
22 52 


Frequency 


Cumulative 
Frequency 


Percent 


/ 1='Never', 2='Married', 


, 2='Catholic' 


Cumulative 
Percent 
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Ехатріе 4 
Frequency Input 


XTAB, like other SYSTAT procedures, reads cases-by-variables data from a SYSTAT 
file. However, if you want to analyze a table from a report or a journal article, you can 
enter the cell counts directly. This example uses counts from a four-way table of a 
breast cancer study of 764 women. The data are from Morrison et al. (1973), cited in 
Bishop, Fienberg, and Holland (1975). There is one record for each of the 72 cells in 
the table, with the count (VUMBER) of women in the cell and codes or category names 
to identify their age group (under 50, 50 to 69, and 70 or over), treatment center 
(Tokyo, Boston, or Glamorgan), survival status (dead or alive), and tumor diagnosis 
(minimal inflammation and benign, maximum inflammation and benign, minimal 
inflammation and malignant, and maximum inflammation and malignant). This 
example illustrates how to form a two-way table of AGE by CENTERS. 


The input is: 


USE CANCER 
XTAB 
FREQUENCY NUMBER 
LABEL AGE / 50-'Under 50', 60-'50 to 69', 70-'70 & Over' 
PLENGTH NONE/FREQ CHISQ 
TABULATE CENTERS * AGE 


The output is: 
Case Frequencies Determined by Value of Variable NUMBER 
Counts 

CENTER$(rows) by AGE(columns) 


Under 50 50 to 69 70 & Over Total 


— MÀ +---------------------------------------- 
Вовбоп Н 58 122 73 253 
біапогап | 71 109 41 221 
Токуо i 151 120 19 290 

—Ó— %--------.-......0............--...-.-.-.... 
Total 280 351 133 764 


Chi-square Tests of Association for CENTERS and AGE 
Test Statistic i Value df p-value 
-----=------------=---+------------------------ 


Pearson Chi-square | 74.039 4.000 0.000 
Number of Valid Cases: 72 


Of the 764 women studied, 290 were treated in Tokyo. Of these women, 151 were in 
the youngest age group, and 19 were in the 70 or over age group. 
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Example 5 
Missing Category Codes 


You can choose whether or not to include a separate category for missing codes. For 
example, if some subjects did not check “male” or “female” on a form, there would be 
three categories for SEX$: male, female, and blank (missing). By default, when values 
of a table factor are missing, SYSTAT does not include a category for missing values. 

In the OURWORLD data file, some countries did not report the GNP to the United 
Nations. In this example, we include a category for missing values, and we follow this 
request with a table that omits the category for missing. 


The input is: 


USE OURWORLD 
XTAB 
PLENGTH NONE/FREQ 
TABULATE GROUPS * GNP$ / MISS 
LABEL GNP$ / 'D'='Developed', 'U'='Emerging' 
TABULATE GROUPS * GNP$ 


The output is: 
Counts 


GROUP$ (rows) by GNP$ (columns) 


| !MISSING! D U Total 
ae Ue %------------------------з-г- 
Europe | 3 17 0 20 
Islamic i 2 4 10 16 
NewWorld | 1 15 5 21 
Э ерер %-есе------------------------- 
Total i 6 36 15 57 


Counts 
GROUP$ (rows) by GNP$ (columns) 


Developed Emerging Total 


Europe 17 0 17 
Islamic 4 10 14 
NewWorld | 15 5 20 


1-258 


Chapter 8 


List Layout 


To create a listing of the counts in each cell of the table: 
PLENGTH NONE/ LIST 
TABULATE GROUPS * GNP$ 
PLENGTH 

The output is: 


Frequency Distribution for GROUPS (rows) by GNP$ (columns) 


GROUPS СМР$ | Frequency Cumulative Percent Cumulative 
H Frequency Percent 

EE са қаса жаба фтесечетссе----к-..-...2--4---с-----------с---- 
Europe Developed | 17 17 33.333 33.333 
Islamic Developed | 4 21 7.843 41.176 
Islamic Emerging | 10 31 19.608 60.784 
NewWorld Developed | 15 46 29.412 90.196 
NewWorld Emerging | 5 51 9.804 100.000 


Note that there is no entry for the empty cell. 


Example 6 
Percentages 


Percentages are helpful for describing categorical variables and interpreting relations 
between table factors. XTAB prints tables of percentages in the same layout as 
described for frequency counts. That is, each frequency count is replaced by the 
percentage. Percentages are computed by dividing each cell frequency by: 

= The total frequency in its row (row percents) 

ш The total frequency in its column (column percents) 


m Тһе total table frequency or sample size (table percents) 


In this example, we request for all three percentages using the following input: 


USE OURWORLD 

XTAB 
LABEL GNP$ / 'D'-'Developed', 'U'='Emerging' 
PLENGTH NONE / ROWP COLP PERCENT 
TABULATE GROUP$ * GNP$ 


The output is: 
Percents of Total Count 


GROUP$(rows) by GNP$(columns) 
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Developed Emerging 


PRENESETE + + 

Europe i 

Islamic | 

NewWorld | i 

---------- + + 

Total i 70.588 29.412 ! 100.000 

N ! 36.000 15.000 | 51.000 
Row Percents 

GROUP$ (rows) by GNP$ (columns) 

| Developed Emerging Total N 


* 

Europe i 100.000 100.000 17.000 
Islamic | 28,571 100.000 14.000 
NewWorld | 75.000 100.000 20.000 
Pico C PE А deren mths ы масс. Bc diis didt аи 
Total i 70.588 100.000 

N i 36.000 51.000 


Column Percents 
GROUP$(rows) by GNP$(columns) 


Developed 


* 
Europe | 
Islamic | 

NewWorld | 41.667 
* 


100.000 100.000 100.000 


36.000 15.000 


i 
‚ 
i 
i 

+ 
i 
' 
i 


Missing Categories 


Notice how the row percentages change when we include a category for the missing 
GNP. 


The input is: 


PLENGTH NONE / ROWP 

LABEL GNP$ / ' '='Missing', 'D'='Developed', 'U'-'Emerging' 
TABULATE GROUP$ * GNP$/MISS 

PLENGTH 
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The output is: 
Row Percents 


GROUP$ (rows) by GNP$ (columns) 


| !MISSING! Developed Emerging | Total N 
а аноды емеске ls жс ыстыққа ысы t Ne epus cT btn 
Europe i 15.000 85.000 0.000 ; 100.000 20.000 
Islamic ! 12.500 25.000 62.500 | 100.000 16.000 
NewWorld | 4.762 71.429 23.810 | 100.000 21.000 
MEGA ај попа паса ти по тлима ши ои има ил ма во лым SLE 17 И Бл BB, cha 
Total ! 10.526 63.158 26.316 | 100.000 
N H 6.000 36.000 15.000 | 57.000 


Here we see that 62.5% of the Islamic nations are classified as emerging. However, 
from the earlier table of row percentages, it might be better to say that among the 
Islamic nations reporting the GNP, 71.43% are emerging. 


List Layout 


To create a listing of the percents, row percents and column percents in each cell of the 
table, the input is: 


PLENGTH NONE/ LIST FREQ 
TABULATE GROUPS * GNP$/MISS 
PLENGTH 

The output is: 


Frequency Distribution for GROUP$ (rows) by GNP$ (columns) 


GROUPS GNPS i Frequency Cumulative Percent Cumulative 
i Frequency Percent 

—Á— | на Де авва Rene ee eia ccm mmm rots an 
Europe !MISSING! | 3 3 5.263 5.263 
Europe Developed | 17 20 29.825 35.088 
Islamic !MISSING! | 2 22 3.509 38.596 
Islamic Developed | 4 26 7.018 45.614 
Islamic Emerging | 10 36 17.544 63.158 
NewWorld !MISSING! | 1 37 1.754 64.912 
NewWorld Developed | 15 52 26.316 91.228 
NewWorld Emerging | 5 57 8.772 100.000 


Counts 
GROUP$(rows) by GNP$(columns) 


| !MISSING! Developed Emerging Total 


3 17 0 20 
Islamic 2 4 10 16 
NewWorld ; 1 15 5 21 
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Example 7 
Two-Way Table Measures 


From the SURVEY? data, you study the relationship between marital status and age. 
This is a general table—while the categories for AGE are ordered, those for MARITAL 
are not. The usual Pearson chi-square measure is used to test the association between 
the two factors. This measure is the default for XTAB. 

The data file is the usual cases-by-variables rectangular file with one record for each 
person. We split the continuous variable AGE into four categories and add names such 
as 30 to 45 for the output. There are too few separated people to tally, so here we 
eliminate them and reorder the categories of MARITAL that remain. To supplement the 
results, we request row percentages. The Recode command is used to recode the values 
of AGE and MARITAL to АДЕ18 and MARITALIS respectively. 


The input is: 
USE SURVEY2 


XTAB 
RECODE AGE1$ = AGE / .. 29='18 to 29', 30 .. 45='30 to 45', 
46 .. 60='46 to 60', 60 .. ='Over 60' 
RECODE MARITAL1$ = MARITAL / 2='Married', 3='Divorced', 
1='Never' 


PLENGTH NONE/ROWPCT FREQ CHISQ 
TABULATE AGE1$ * MARITAL1$ 


The output is: 
Counts 


AGE1$(rows) by MARITAL1$ (columns) 


| Divorced Married Never Total 
---------- %----------------------------------- 


18 to 29 | 5 17 53 459 
30 to 45 | 21 48 9 78 
46 to 60 } 12 39 8 59 
Over 60 | 5 23 3 31 
Аа %--------т<-------<----------------- 
Total t 43 127 73 243 
Row Percents 
AGE1$ (rows) by MARITAL1$ (columns) 
| Divorced Married Never | Total N 
——À $n SEER Du ees DAMEN ERR тицали nae 
18 to 29 | 6.667 22.667 70.667 ; 100.000 75.000 
30 to 45 | 26.923 61.538 11.538 | 100.000 78.000 
46 to 60 | 20.339 66.102 13.559 | 100.000 59.000 
Over 60 | 16.129 74.194 i 31.000 
To aM ж--------<<4с-------------------- em em tres es Е-е da t m m tnam 
Total | 17.695 52.263 
i 43.000 1.270Е%002 2.430Е%002 
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Chi-square Tests of Association for AGE1$ and MARITAL1$ 


Test Statistic Value df p-value 


Number of Valid Cases: 243 


Even though the chi-square statistic is highly significant (87.761; p-value < 0.0005), in 
the Row percentages table, you see that 70.67% of the youngest age group fall into the 
never-married category. Many of these people may be too young to consider marriage. 


Eliminating a Stratum 


If you eliminate the subjects in the youngest group, is there an association between 
marital status and age? To address this question, the input is: 


SELECT AGE » 29 


PLENGTH NONE/ CHISQ PHI CRAMER CONT ROWPCT FREQ 
TABULATE AGE1$ * MARITAL1$ 
SELECT 


The output is: 


Data for the following results were selected according to 
SELECT AGE > 29 


Counts 
AGEl$(rows) by МАВІТАІ1$ (columns) 


Divorced Married Never Total 


ылымы жесе амен емее кыста ысы ДЫ 
30 to 45 | 21 48 9 78 
46 «о 60! 12 39 8 59 
Over 60 } 5 23 3 31 
аа аса осы мае сс но JUN MIN ++ 
Total ! 38 110 20 168 


Row Percents 


AGE1$(rows) by MARITAL1$ (columns) 


| Divorced Married Never } Total N 
—5HÀ— 4-------------2--------------22-.---4---2-2---2--2--2-2-2--22-2-2--- 
30 to 45 | 26.923 61.538 11.538 | 100.000 78.000 
46 to 60 | 20.339 66.102 13.559 | 100.000 59.000 
Over 60 | 16.129 74.194 9.677 | 100.000 31.000 
сөенә фе %---------------------------......................... 
Total Н 22.619 65.476 11.905 | 100.000 
N Н 38.000 1.100Е+002 20.000 | 1.680Е+002 


Chi-square Tests of Association for АСЕ1$ and MARITAL1$ 


Test Statistic | Value df p-value 
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Pearson Chi-square ) 2.173 4.000 0.704 


Measures of Association for AGE1$ and MARITAL1$ 


Coefficient ) Value ASE 95 Confidence Interval 2 
Н Lower Upper 
—— eternal %-----------------------«----------------------- 
Phi | 0.114 
Cramer's V ; 0.080 
Contingency | 0.113 


Number of Valid Cases: 168 


The proportion of married people is larger within the Over 60 group than for the 30 to 
45 group—74.19% of the former are married while 61.54% of the latter are married. 
The youngest stratum has the most divorced people. However, you cannot say these 

proportions differ significantly (chi-square = 2.173, p-value = 0.704). 


Example 8 
Two-Way Table Measures (Long Results) 


This example illustrates LONG results and table input. It uses the AGE by CENTERS 
table from the cancer study described in the frequency input example. 


The input is: 


USE CANCER 
LDISPLAY BOTH 
XTAB 
FREQUENCY NUMBER 
PLENGTH LONG 
LABEL AGE / 50='Under 50', 60='50 to 69', 
TABULATE CENTERS * AGE / CONFI-0.95 


The output is: 
Case frequencies determined by value of variable NUMBER 


Frequency Distribution for CENTERS (rows) by AGE (columns) 


t Cumulative 
CENTERS AGE | Frequency Frequency Percent 
e tia пи жс-с--о-«ес-------------------------- 
Boston 50) Under 50 ! 58 58 7.592 
Boston 60) 50 to 69 |i 122 180 15.969 
Boston 70) 70 & Over | 73 253 9.555 
Glamorgn 50) Under 50 ! 71 324 9.293 
Glamorgn 60) 50 to 69 | 109 433 14.267 
Glamorgn 70) 70 & Over | 41 474 5.366 
Tokyo 50) Under 50 ! 151 625 19.764 
Tokyo 60) 50 to 69 i 120 745 15.707 
Tokyo 70) 70 & Over ; 19 764 2.487 


Table of Counts and Percents 


70-!70 & Over' 


Cumulat 
Perc 
7. 
23. 
33: 
42. 
56. 
62. 
81. 
97. 
100. 


ive 
ent 
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CENTER$ (rows) by AGE (columns) 
60) 50 to 69 70) 70 & Over i Total 


СИИ Sess ысық сла вени арена лғ Sia 
Boston | 58(7.592%) 122(15.969%) 73(9.555%) | 253 (33.115% 
Glamorgn ! 71(9.293%) 109(14.267%) 41(5.366%) 1 221(28.927% 
Tokyo 1 151(19.764%) 120(15.707%) 19(2.487%) | 290(37.958% 
сы ссе Lace cd cu Аб snc aay. ыы ube а аны 
Total | 280(36.649%) 351(45.942%) 133(17.408%) | 764(100.0% 
Counts 


CENTERS (rows) by AGE (columns) 
50) Under 50 60) 50 to 69 70) 70 & Over Total 


n 

mr ааа iE EE LEE 
Boston 1 58 122 73 253 
Glamorgn | 71 109 41 221 
Tokyo D 151 120 19 290 
pus cnt Ж-е«----<---->--«----------------ч----е--------<----т-- 
Total i 280 351 133 764 


Percents of Total Count 
CENTER$ (rows) by AGE (columns) 
50) Under 50 60) 50 to 69 70) 70 & Over 


1 П 
4-2------22.21.----------1...................-. сым жы ы c 
i 9.555 | 33.115 253.000 
i 5.366 | 28.927 221.000 
i 2.487 | 37.958 290.000 
e па ae was er f= дь wm am gu since ана aw AS inde cpi 4------------------ 
Тоса1 1 36.649 45.942 17.408 ! 100.000 
N ! 280.000 351.000 133.000 | 764.000 
Row Percents 
CENTERS (rows) by AGE (columns) 
1 50) Under 50 60) 50 to 69 70) 70 & Over | Total N 
титан" + Н 
Boston i 28.854 | 100.000 253.000 
Glamorgn } 18.552 | 100.000 221.000 
Tokyo i 6.552 | 100.000 290.000 
ipid i reg EE nt a tU RR оссе ai crasse iei itas ea leg nem be AE ынаным, 
Total Н 17.408 | 100.000 
N ' 133.000 | 764.000 
Column Percents 
CENTERS (rows) by AGE(columns) 
1 50) Under 50 60) 50 to 69 70) 70 & Over | Total N 
— ганат i Ruta не айда depu sum meri suu терінен тн eee eA Ene Kaas aS aa 
Boston 1 20.714 34.758 54.887 | 33.115 253.000 
Glamorgn | 25.357 31.054 30.827 | 28.927 221.000 
Tokyo Н 290.000 


Total i 100.00 
i 280.000 351.000 133.000 ; 764.000 


Expected Values 
CENTERS (rows) by AGE (columns) 
50) Under 50 


60) 70) 70 & Over 


Boston 92.723 116.234 44.043 
Glamorgn 80.995 101.533 38.473 
Tokyo 106.283 133.233 50.484 


Deviates: (Observed-Expected) 


CENTER$ (rows) by AGE (columns) 


Boston «34.123 
Glamorgn -9.995 
Tokyo 44.717 


Standardized Deviates: (Observed-Expected) /SQR (Expected) 


CENTER$ (rows) by AGE (columns) 
50) Under 50 60) 


a қабы mm 
Boston Н -3.606 
Glamorgn | -1.111 
Токуо Н 4.338 
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50 to 69 70) 70 & Over 
5.766 28.957 
7.467 2.527 

-13.233 -31.484 


50 to 69 70) 70 6 Over 
0.535 4.363 
0.741 0.407 

-1.146 -4.431 


Chi-square tests of association for CENTERS and AGE 


Test Statistic 


Pearson Chi-square 
Likelihood Ratio Chi-square 
McNemar Symmetry Chi-square 


1 
i 
i 
————— — езе сел alea Вама + 
i 
ћ 
i 
! 
i 


Value df p-value 
74.039 4.000 0.000 
76.963 4.000 0.000 
79.401 3.000 0.000 


Measures of Association for CENTER$ and AGE 


Coefficient 


Phi 

Cramer's V 

Contingency 
Goodman-Kruskal's Gamma 
Kendall's tau-b 

Stuart's tau-c 

Cohen's kappa 

Spearman's rho 

Somers'd (column dependent) 
Somers'd (row dependent 
Lambda (column dependent) 
Lambda (row dependent) 
Lambda (Symmetric) 


Uncertainty (column dependent) 


Uncertainty (row dependent) 
Uncertainty (Symmetric) 


Coefficient 
Phi 

Cramer's V 

Contingency 
Goodman-Kruskal's Gamma 
Kendall's tau-b 

Stuart's tau-c 

Cohen's kappa 

Spearman's rho 

Somers'd (column dependent) 
Somers'd (row dependent) 


" 

i ASE Lower 
——— кесте жон Rip 

i 0.043 -0.502 

i 0.030 -0.333 

i 0.029 -0.322 

i 0.022 -0.155 

Н 0.033 -0.370 

i 0.030 -0.324 

Н 0.031 -0.344 

' 0.053 -0.029 

i 0.036 0.047 

i 0.045 0.009 

| 0.011 0.028 

i 0.010 0.026 

| 0.010 0.027 


2 p-value 


! 
Жесысынқманнынен cuin 
! 
i 
i 
| -9.637 0.000 
| -9.241 0.000 
| -9.063 0.000 
| -5.234 0.000 
| -9.200 0.000 
| -9.031 0.000 
| -9.021 0.000 


Confidence Interval 
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Lambda (column dependent) 
Lambda (row dependent) 


Lambda (Symmetric) 


Uncertainty (column dependent) 
Uncertainty (row dependent) 
Uncertainty (Symmetric) 


Number of Valid Cases: 
Measures of Association for CENTERS 


The null hypothesis for the Pearson chi-square test is that the table factors are 
independent. You reject the hypothesis (chi-square = 74.039, p-value < 0.0005). We аге 
concerned about the analysis of the full table with four factors in the cancer study 
because we see an imbalance between AGE and study CENTER. The researchers in 
Tokyo entered a much larger proportion of younger women than did the researchers in 


the other cities. 


Notice that with PLENGTH LONG, SYSTAT reports all statistics for an r x c table 
including those that are appropriate when both factors have ordered categories 
(gamma, tau-b, tau-c, and Spearman's rho). 


Saving Measures 


The Measures computed for two-way tables can be saved to a data file. This is useful 


72 


if you want to do resampling. 


The input is: 


SAVE TWOB / MEASURES 


PLENGTH NONE / FREQ CHISQ YATES FISHER ODDS, 
YULE PHI CRAMER CONT UNCE LAMBDA LRCHI, 
COCHRAN MCNEM KAPPA RHO GAMMA TAUB TAUC, 


SOMERS 


LDISPLAY LABEL 
TABULATE ТОМОВ5 CENTER$*AGE 


The output is: 


Case Frequencies Determined by Value of Variable NUMBER 


Counts 


TUMORS (rows) by AGE(columns) 


MinMalig 
MinBengn ; 
MaxMalig | 
MaxBengn 


Under 50 


50 to 69 


i 
' 
' 
i 
1 
i 
' 


.420 
.263 
.170 
.646 
.599 
.624 


d» IN UU S 


and AGE 


70 & Over 


oooooo 


.156 
.001 
.030 
.000 
.000 
.000 


1-267 


Crosstabulation (One-Way, Two-Way, апа Multiway) 


Chi-square Tests of Association for TUMORS and AGE 


Test Statistic ; Value df p-value 
' 
! 


Pearson Chi-square i 4.615 6.000 0.594 
Likelihood Ratio Chi-square ; 4.895 6.000 0.557 


Measures of Association for TUMOR$ and AGE 


Coefficient ASE 95% Confidence Interval 2 
Lower Upper 
Phi |! 0.078 
Cramer's V i 0.055 
Contingency i 0.077 
Goodman-Kruskal's Gamma 1 -0.047 0.052 -0.148 0.055 -0.905 
Kendall's tau-b | -0.029 0.033 -0.093 0.035 -0.899 
Stuart's tau-c | -0.028 0.031 -0.088 0.032 -0.904 
Spearman's rho ! -0.033 0.036 -0.103 0.038 -0.901 
Somers'd (column dependent) į -0.029 0.032 -0.093 0.034 -0.905 
Somers'd (row dependent) ! -0.029 0.033 -0.093 0.034 -0.905 
Lambda (column dependent) 1 0.000 0.055 -0.107 0.107 0.000 
Lambda (row dependent) 1 0.000 0.000 0.000 0.000 0.000 
Lambda (Symmetric) i 0.000 0.027 -0.054 0.054 0.000 
Uncertainty (column dependent) | 0.003 0.003 -0.002 0.008 1.146 
Uncertainty (row dependent) 1 0.003 0.002 -0.002 0.008 1.147 
Uncertainty (Symmetric) i 0.003 0.003 -0.002 0.008 1.146 
Coefficient | p-value 
i 
eee E наты а + = 
Phi 
Cramer's V 
Contingency 
Goodman-Kruskal's Gamma i 0.365 
Kendall's tau-b i 0.369 
Stuart's tau-c D 0.366 
Spearman's rho i 0.368 
Somers'd (column dependent) i 0.366 
Somers'd (row dependent) i 0.366 
Lambda (column dependent) i 1.000 
Lambda (row dependent) i 0.000 
Lambda (Symmetric) i 1.000 
Uncertainty (column dependent) | 0.252 
Uncertainty (row dependent) 1 0.252 
Uncertainty (Symmetric) i 0.252 


Number of Valid Cases: 72 


Counts 
CENTER$(rows) by AGE(columns) 
Under 50 50 to 69 70 & Over Total 


i 
REIS pated At OE ЖИЛ је 
Tokyo H 151 120 19 290 
Boston 1 58 122 73 253 
Glamorgn | 71 109 41 221 
слагала - oed | ПАДИНИ te LL 
Total ! 280 351 133 164 


Chi-square Tests of Association for CENTERS and AGE 


Test Statistic Value df p-value 


' 
i 
i 
---------«------------------- + 
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Pearson Chi-square 1 74.039 4.000 0.000 
Likelihood Ratio Chi-square | 76.963 4.000 0.000 
McNemar Symmetry Chi-square ; 58.761 3.000 0.000 


Measures of Association for CENTERS and АСЕ 


Coefficient i} Value ASE 95% Confidence Interval 2 
Н Іомег Upper 

oL СА қа, асан ы ы теме Шə тыс Ба m t o тв вене 
Phi 10.311 

Cramer's V 1 0.220 

Contingency i 0.297 

Goodman-Kruskal's Gamma i 0.286 0.046 0.196 0.376 6.250 
Kendall's tau-b | 0.188 0.028 0.133 0.244 6.644 
Stuart's tau-c i 0.182 0.030 0.123 0.240 6.104 
Cohen's kappa | 0.105 0.026 0.055 0.156 4.078 
Spearman's rho i 0.211 0.034 0.144 0.278 6.201 
Somers'd (column dependent) | 0.183 0.030 0.124 0.242 6.068 
Somers'd (row dependent) | 0.194 0.032 0:331 0.257 6.064 
Lambda (column dependent) i 0.075 0.063 -0.049 0.199 1.191 
Lambda (row dependent) | 0.118 0.036 0.047 0.189 3.263 
Lambda (Symmetric) 1 0.097 0.050 -0.001 0.194 1.947 
Uncertainty (column dependent) ; 0.049 0.011 0.028 0.070 4.646 
Uncertainty (row dependent) 1 0.046 0.010 0.026 0.066 4.599 
Uncertainty (Symmetric) 1 0.047 0.010 0.027 0.068 4.624 
Coefficient | p-value 
a i Ld пате 

Phi | 

Cramer's У Д 

Contingency 

Goodman-Kruskal's Gamma i 0.000 

Kendall's tau-b | 0.000 

Stuart's tau-c i 0.000 

Cohen's kappa 1 0.000 

Spearman's rho i 0.000 

Somers'd (column dependent) i 0.000 

Somers'd (row dependent) Н 0.000 

Lambda (column dependent) 1 0.234 

Lambda (row dependent) 1 0.001 

Lambda (Symmetric) Н 0.052 

Uncertainty (column dependent) ! 0.000 

Uncertainty (row dependent) i 0.000 

Uncertainty (Symmetric) | 0.000 


Number of Valid Cases: 72 
File has been saved and closed 


When the SAVE command is given, the saved file can be viewed automatically from 
the view tab of data editor. You can also notice that LDISPLAY BOTH is used to 
display LABELS and DATA in the output where as LDISPLAY LABEL is used to 
display only the LABELS in the output. 


1-269 


Crosstabulation (One-Way, Two-Way, and Multiway) 


Example 9 
Odds Ratios 


For a table with cell counts a, b, c, and d: 


Exposure 
yes no 
Disease yes a b 
no c d 


where, if you designate the Disease yes people sick and the Disease no people well, the 
odds ratio (or cross-product ratio) equals the odds that a sick person is exposed divided 
by the odds that a well person is exposed, or: 


(a/b)/(c/d) = (ad)/(bc) 


If the odds for the sick and disease-free people are the same, the value of the odds ratio 
is 1.0. 

Asan example, use the SURVEY? file and study the association between gender and 
depressive illness. Be careful to order your table factors so that your odds ratio is 
constructed correctly (we use LABEL to do this). 


The input is: 


USE SURVEY2 

XTAB 
LABEL CASECONT / 1='Depressed', 0-'Normal' 
PLENGTH NONE/ FREQ ODDS CHISQ 
TABULATE SEX$ * CASECONT/ CONFI=0.95 


The output is: 
Counts 
SEX$ (rows) by CASECONT (columns) 


Normal Depressed Total 


Male i 96 8 104 
Female 116 36 152 
а E ш 
Total 212 44 256 


Chi-square Tests of Association for SEX$ and CASECONT 


Test Statistic | Value df p-value 


Pearson Chi-square | 11.095 1.000 0.001 
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Measures of Association for SEX$ and CASECONT 


Coefficient | Value ASE 95% Confidence Interval 7 p-value 
Lower Upper 


Odds Ratio ! 3.724 
Ln (Odds) 1 1.315 0.415 0.502 2.127 3.172 0.002 


Number of Valid Cases: 256 


The odds that a female is depressed are 36 to 116, the odds for a male are 8 to 96, and 
the odds ratio is 3.724. Thus, in this sample, females are almost four times more likely 
to be depressed than males. But, does our sample estimate differ significantly from 
1.0? Because the distribution of the odds ratio is very skewed, significance is 
determined by examining Ln(Odds), the natural logarithm of the ratio, and the standard 
error of the transformed ratio. Note the symmetry when the ratios are transformed: 


3 Ln3 
2 Ln2 
1 Іл0 
1/2 -Ln2 
1/3 -Ln3 


The value of Ln(Odds) here is 1.315 with a standard error of 0.415. Constructing an 
approximate 95% confidence interval using the statistic plus or minus two times its 
standard error: 


1.315 £2* 0.415 = 1.315 + 0.830 


results in: 
0.485 « Ln(Odds) « 2.145 


Because 0 is not included in the interval, Ln(Odds) differs significantly from 0, and the 
odds ratio differs from 1.0. 


Using the calculator to take antilogs of the limits. You can use SYSTAT’s calculator to 


take antilogs of the limits EXP(0.485) and EXP(2.145) and obtain a confidence interval 
for the odds ratio: 


e? 485) (2.145) 


< odds ratio « e 


1.624 « odds ratio « 8.542 
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That is, for the lower limit, type CALC EXP(0.485). 


Notice that the proportion of females who are depressed is 0.2368 (from a table of row 
percentages not displayed here) and the proportion of males is 0.0769, so you also 
reject the hypothesis of equality of proportions (chi-square = 11.095, p-value = 0.001). 


Example 10 
Fisher’s Exact Test 


Let us say that you are interested in how salaries of female executives compare with 
those of male executives at a particular firm. The accountant there will not give you 
salaries in dollar figures but does tell you whether the executives’ salaries are low or 


high: 

Low High 
Male 2 7 
Female 5 1 


Тһе sample size is very small. By setting PLENGTH NONE, you request three of these: 
Fisher's exact test, chi-square test, and Yates! corrected chi-square. 


The input is: 


USE SALARY 

LDISPLAY LABEL 

XTAB 
FREQUENCY COUNT 
LABEL SEX / 1='male', 2='female' 
LABEL EARNINGS / 1='low', 2='high' 
PLENGTH NONE / FISHER CHISQ YATES FREQ 
TABULATE SEX * EARNINGS 


The output is: 
Counts 
SEX (rows) by EARNINGS (columns) 


low high Total 


male ! 2 7 9 
female 5 1 6 
ae. "5. ж------------------- 
Total } 7 8 15 


*** WARNING *** : More than One-fifth of the fitted Cells are sparse (Frequency < 
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Significance Tests computed on this table are Suspect. 
Chi-square tests of association for SEX and EARNINGS 
Test Statistic i Value df p-value 
Pearson Chi-square 1 5.402 1.000 0.020 


Fisher Exact Test (two-tail) 0.041 
Number of Valid Cases: 4 


Notice that SYSTAT warns you that the results are suspect because the counts in the 
table are too low (sparse). Technically, the message states that more than one-fifth of 
the cells have expected values (fitted values) of less than 5. 

The p-value for the Pearson chi-square (0.020) leads you to believe that SEX and 
EARNINGS are not independent. But there is a warning about suspect results. This 
warning applies to the Pearson chi-square test but not to Fisher’s exact test, Fisher’s 
test counts all possible outcomes exactly, including the ones that produce an 
interaction greater than what you observe. The Fisher exact test p-value is also 
significant. On this basis, you reject the null hypothesis of independence (no 
interaction between SEX and EARNINGS). 


Sensitivity 


Results for small samples, however, can be fairly sensitive. One case can matter. What 
if the accountant forgets one well-paid male executive? 


Case Frequencies Determined by Value of Variable COUNT 


Counts 


SEX(rows) by EARNINGS (columns) 


male 2 6 8 
female | 5 1 6 
кааран, о ЫНЫ —— 
Total | 7 7 14 


*** WARNING *** : More than One-fifth of the fitted Cells are sparse (Frequency « 5). 
Significance Tests computed on this table are Suspect. 

Chi-square tests of association for SEX and EARNINGS 

Test Statistic | Value ағ p-value 
Pearson Chi-square — ҮТЕ 1.000 0.020 - 
Fisher Exact Test (two-tail) ! 0.041 


Number of Valid Cases: 4 
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The results of the Fisher exact test indicate that you cannot reject the null hypothesis 
of independence. It is too bad that you do not have the actual salaries. Much 
information is lost when a quantitative variable like salary is dichotomized into LOW 
and HIGH. 


What Is a Small Expected Value? 


In larger contingency tables, you do not want to see any expected values less than 1.0 
or more than 20% of the values less than 5. For large tables with too many small 
expected values, there is no remedy but to combine categories or possibly omit a 
category that has very few observations. 


Example 11 
Cochran’s Test of Linear Trend 


When one table factor is dichotomous and the other has three or more ordered 
categories (for example, low, median, and high), Cochran’s test of linear trend is used 
to test the null hypothesis that the slope of a regression line across the proportions is 0. 
For example, in studying the relation of depression to education, you form this table 
for the SURVEY? data and plot the proportion depressed: 


03 


Depressed 14 18 11 1 
Normal 36 80 75 21 02 
РФ 28 18 43 05 ° 
= 27 
pr 2 3 4 
EDUCATN 


If you regress the proportions on scores 1, 2, 3, and 4 assigned by SYSTAT to the 
ordered categories, you can test whether the slope is significant. 
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This is what we do in this example. We also explore the relation of depression to health. 


The input is: 


USE SURVEY2 

LDISPLAY LABEL 

XTAB 
LABEL CASECONT / 1='Depressed', 0='Normal' 
RECODE EDUCATN1$=EDUCATN / 1,2='Dropout', 3='HS grad', 

4,5='College', 6,7='Degree +' 
RECODE HEALTHY1$=HEALTHY / 1='Excellent', 2='Good', 
3,4='Fair/Poor' 

PLENGTH NONE / FREQ COLPCT COCHRAN CHISQ 
TABULATE CASECONT * EDUCATN1$ 
TABULATE CASECONT * HEALTHY1$ 


The output is: 
Counts 
CASECONT(rows) by EDUCATN1$ (columns) 


College HS grad Dropout Degree + Total 


i 
EAEE, gemere tine ie Рн i meni 
Normal i 75 80 36 21 212 
Depressed | 11 18 14 1 44 
erie hime танға Жетен сек Pd aed ror RE i het А еван) 
Total 1 86 98 50 22 256 


Column Percents 


CASECONT (rows) by EDUCATN1$ (columns) 


| College HS grad Dropout Degree + |! Total N 
жебені а ee ee ie ссе. 
Normal i 87.209 81.633 72.000 95.455 | 82.813 212.000 
Depressed | 12.791 18.367 28.000 4.545 | 17.188 44.000 
dg: iu nu mr AL s NERO Rc ddl 
Total i 100.000 100.000 100.000 100.000 ; 100.000 
N i 86.000 98.000 50.000 22.000 | 256.000 


Chi-square Tests of Association for CASECONT and ЕРОСАТМ15 


Test Statistic | Value df p-value 
LO PL а 1) Ie TU 122. 
Pearson Chi-square | 7.841 3.000 0.049 
Cochran's Linear Trend | 0.413 1.000 0.521 


Number of Valid Cases: 256 
Counts 


CASECONT (rows) by HEALTHY1$ (columns) 


* 
Normal ! 105 78 29 212 
Depressed ! 16 15 13 44 
-4------... %--------<-------------..----..-----2. 
Total Н 121 93 42 256 


Column Percents 
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CASECONT (rows) by HEALTHY1$ (columns) 


Excellent Good Fair/Poor ! Total N 


Normal i 86.777 83.871 69.048 ; .813 212.000 
Depressed | 13.223 16.129 30.952 | 17.188 44.000 
mu E qe Aem %---------------------------------%------------------ 
Total H 100.000 100.000 100.000 ! 100.000 

N E 121.000 93.000 42.000 1 256.000 


Chi-square Tests of Association for CASECONT and HEALTHY1$ 


Test Statistic ! Value df p-value 
EN анана Намењен Е фе жен 
Pearson Chi-square i 7.000 2.000 0.030 
Cochran's Linear Trend | 5.671 1.000 0.017 


Number of Valid Cases: 256 


As the level of education increases, the proportion of depressed subjects decreases 
(Cochran's Linear Trend = 0.413, df= 1, and Prob (p-value) = 0.521). Of those not 
graduating from high school (Dropout), 28% are depressed, and 4.55% of those with 
advanced degrees are depressed. Notice that the Pearson chi-square is marginally 
significant (p-value = 0.049). It simply tests the hypothesis that the four proportions 
are equal rather than decreasing linearly. 

In contrast to education, the proportion of depressed subjects tends to increase 
linearly as health deteriorates (p-value = 0.017). 


Example 12 
Tables with Ordered Categories 


In this example, we focus on statistics for studies in which both table factors have a few 
ordered categories. For example, a teacher evaluating the activity level of 
schoolchildren may feel that she cannot score them from 1 to 20 but that she could 
categorize the activity of each child as sedentary, normal, or hyperactive. Here you 
study the relation of health status to age. If the category codes are character-valued, you 
must indicate the correct ordering (as opposed to the default alphabetical ordering). 

For Spearman's rho, instead of using actual data values, the indices of the categories 
are used to compute the usual correlation. Gamma measures the probability of getting 
like (as opposed to unlike) orders of values. Its numerator is identical to that of 


Kendall’s tau-b and Stuart's tau-c. 
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The input is: 


USE SURVEY2 

LDISPLAY LABEL 

XTAB 

RECODE HEALTHY1$=HEALTHY / 1='Excellent', 2='Good', 
3,4='Fair/Poor' 

RECODE АСЕ15- АСЕ / 4. 28 «718 to 29', 30 .. 45ж!30 to 
45", 46 .. 60='46 to 60", 
60 ..='Over 60' 

PLENGTH NONE / FREQ ROWP GAMMA RHO CHISQ 

TABULATE HEALTHY1$ * АСЕ15 


The output is: 
Counts 


HEALTHY1$ (rows) by AGE1$ (columns) 


18 to 29 30 to 45 46 to 60 Over 60 Total 


DIN CM B = EE ESL LIT a же EOS ЫЕ БЕШЛИ анаға лыны 
Excellent ; 43 48 25 5 121 
Fair/Poor } 6 9 15 12 42 
Good } 30 23 24 16 93 

Ј ваша BT mr reni Пао E А ДА АР АМАН СЫ ЊЕ на ДА TR ТАДЫ сива 
Total i 79 80 64 33 256 


Row Percents 


HEALTHY1$ (rows) by AGE1$ (columns) 


| 18 to 29 30 to 45 46 to 60 Over 60 } Total N 

mee Жәке кес ом есен APPS Пи Go GNU Oc еее PRU NUN 

Excellent | 35.537 39.669 20.661 4.132 | 100.000 121.000 

Fair/Poor | 14.286 21.429 35.714 28.571 ! 100.000 42.000 

Good i 32.258 24.731 25.806 17.204 ; 100.000 93.000 

Sea ee e aL ae eee: салыс АНАСЫ DOE М Tn. QS CE A 

Total Н 30.859 31.250 25.000 12.891 ! 100.000 

N Н 79.000 80.000 64.000 33.000 | 256.000 
Chi-square Tests of Association for HEALTHY1$ and AGE1$ 

Test Statistic i Value df p-value 

х= Аа бла te em tise sel os са i iile 

Pearson Chi-square ! 29.380 6.000 0.000 
Measures of Association for HEALTHY1$ and AGE1$ 

Coefficient ASE 95 $ Confidence Interval 2 p-value 

Lower Upper 

Goodman-Kruskal's Gamma | 0. 0.077 0.061 0.363 2.746 0.006 

Spearman's rho К 0.061 0.044 0.282 2.684 0.007 


Number of Valid Cases: 256 


Not surprisingly, as age increases, health status tends to deteriorate, In the table of row 
percentages, notice that among those with EXCELLENT health, 4.13% are in the oldest 


age group; in the GOOD category, 17.2% are in the oldest group; and in the 
FAIR/POOR category, 28.57% are in the oldest group. 
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The value of gamma is 0.212; rho is 0.163. Here are confidence intervals (Value + 2€ 
Asymptotic Std Error) for each statistic: 


0.061<-0.212<-0.363 
0.044<-0.163<-0.282 


Because 0 is in neither interval, you conclude that there is an association between 
health and age. 


Example 13 
McNemar’s Test of Symmetry 


In November 1993, the U.S. Congress approved the North American Free Trade 
Agreement (NAFTA). Let us say that two months before the approval and before the 
televised debate between Vice President Al Gore and businessman Ross Perot, political 
pollsters queried a sample of 350 people, asking “Are you for, unsure, or against 
NAFTA?” Immediately after the debate, the pollsters contacted the same people and 
asked the question a second time. Here are the responses: 


After 
For Unsure Against 
For 51 22 28 
Before Unsure 46 18 27 
Against 52 49 57 


The pollsters wonder, “Is there a shift in opinion about NAFTA?” The study design for 
the answer is similar to a paired / test—each subject has two responses. The row and 
column categories of our table are the same variable measured at different points in 
time. 


The file NAFTA contains these data. To test for an opinion shift, the input is: 


USE NAFTA 

XTAB 
FREQUENCY COUNT 
ORDER BEFORE$ АҒТЕР5 / SORT-'for','unsure', 'against' 
PLENGTH NONE / FREQ MCNEMAR CHISQ PERCENT 
TABULATE BEFORE$ * АЕТЕВ5 


We use ORDER to ensure that the row and column categories are ordered the same. 
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The output is: 
Case frequencies determined by value of variable COUNT 
Counts 


BEFORE$ (rows) by AFTER$ (columns) 


— sauren ies anime ын. prem eh rq ente merde iq 
for 17251 22 28 101 
unsure | 46 18 27 91 
against | 52 49 57 158 
а раде pm oe meg ys wes mr mies tem анана weeks me 
Total 1 149 89 112 350 


Percents of Total Count 


BEFORE$ (rows) by AFTER$ (columns) 


i for unsure against | Total N 
rn 6 dnm rm minimam em meno 
for Н 0 0 0: 28.857 101.000 
unsure | 0 0 0; 26.000 91.000 
against | 0 о 0) 45.143 158.000 
Sains aie dorm mr mn a cres fg wm cm md ee oem ee es 
Total i 42.571 25.429 32.000 | 100.000 
N ! 149.000 89.000 112.000 | 350.000 


Chi-square tests of association for ВЕҒОВЕ5 and AFTERS 


Test Statistic ! Value df p-value 
pep camp ia aa ein ses Ен каа СТРИ R 
Pearson Chi-square | 11.473 4.000 0.022 
McNemar Symmetry Chi-square ! 22.039 3.000 0.000 


Number of Valid Cases: 9 


The McNemar test of symmetry focuses on the counts in the off-diagonal cells (those 
along the diagonal are not used in the computations). We are investigating the direction 


of change in opinion. First, how many respondents became more negative about 
NAFTA? 


Ш Among those who initially responded For, 22 (6.29%) are now Unsure and 28 
(8%) are now Against. 


= Among those who were Unsure before the debate, 27 (7.71%) answered Against 
afterwards. 


The three cells in the upper right contain counts for those who became more 
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in 
the lower left contain counts for people who became more positive about NAFTA 
(51, 46, and 52) or 42% of the sample. 

The null hypothesis for the McNemar test is that the changes in opinion are equal. 
The chi-square statistic for this test is 22.039 with 3 dfand p-value < 0.0005. You reject 
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the null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the 
anti-NAFTA shift. 

You also clearly reject the null hypothesis that the row (BEFORE$) and column 
(AFTERS) factors are independent (chi-square = 11.473; p-value — 0.022). However, a 
test of independence does not answer your original question about change of opinion 
and its direction. 


Example 14 
Multiway Tables 


When you have three or more table factors, XTAB forms a series of two-way tables 
stratified by all combinations of values ofthe third, fourth, and so on, table factors. The 
order in which you choose the table factors determines the layout. Your input can be 
the usual cases-by-variables data file or the cell counts with category values. 


The input is: 


USE CANCER 

XTAB 

FREQUENCY NUMBER 

LABEL AGE  / 50-'Under 50', 60='50 to 69', 70='70 
& Over' 

ORDER CENTERS / SORT-NONE 

ORDER TUMOR$ / SORT-'MinBengn', 'MaxBengn', 
'MinMalig', 'MaxMalig' 

TABULATE SURVIVES$ * TUMORS * CENTERS * AGE 
PLENGTH NONE / FREQ 

STD SURVIVES * TUMORS * CENTERS * АСЕ 

REM Frequency of total association 

PLENGTH NONE / FREQ 

TABULATE CENTER$ * AGE 


The last two factors selected (CENTERS and AGE) define two-way tables. The levels 
of the first two factors define the strata. After the table is run, we edited the output and 
moved the four tables for SURVIVES = Dead next to those for Alive. The new 
command STD is included to statistically remove the effect of strata variables so that 
relationship between CENTERS and AGE can be examined without SURVIVES and 
TUMORS. We can compare an original total association of CENTERS and AGE and the 
same association where the effects of some test factors i.e. SURVIVES апа TUMORS 


have been statistically removed. 
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The output is: 
Case Frequencies Determined by Value of Variable NUMBER 


Counts 
SURVIVES = Alive 
TUMORS = MinBengn 


CENTERS (rows) by AGE(columns) 


SURVIVES = Alive 
TUMORS = MaxBengn 


CENTER$ (rows) by AGE (columns) 
Under 50 50 to 69 705 Over Total 


SURVIVES = Alive 
TUMORS = MinMalig 


CENTER$ (rows) by AGE (columns) 


SURVIVES = Alive 
TUMORS = MaxMalig 


SURVIVES = Dead 
TUMOR$ = MinBengn 


CENTER$ (rows) by АСЕ (columns) 


70 & Over 
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SURVIVES = Dead 
TUMORS = MaxBengn 


CENTERS (rows) by AGE(columns) 


Under 50 50 to 69 


П 
D 
* 
Boston Н 
+ 


SURVIVES = Dead 
TUMORS = MinMalig 


CENTERS (rows) by AGE(columns) 
50 to 69 70 & Over Total 


9 2 20 
8 9 23 
14 3 33 
31 14 76 


SURVIVES = Dead 
TUMORS = MaxMalig 


CENTERS (rows) by AGE (columns) 
Under 50 50 to 69 70 & Over Total 


Row Percents 
SURVIVES = Alive 
TUMORS = MinBengn 


CENTERS (rows) by AGE (columns) 


i Under 50 50 to 69 70 & Over | 

Ишими igi + 

56.667 38.333 i 
22.222 53.704 


42 37.584 47.987 
1.120Е+002 1.430Е%002 


| 
| 
| 
! 

Н 
! 
| 
| 


SURVIVES = Alive 
TUMORS = MaxBengn 


CENTERS (rows) by AGE (columns) 


| Under 50 
pedo E 4----------<-------- 


12.000 


100.000 
100.000 4.000 
100.000 6.000 


100.000 


1.200Е%002 
1.080Е%002 
70.000 


2.980Е%002 


25.000 
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SURVIVES = Alive 
TUMORS = MinMalig 


CENTERS (rows) by AGE (columns) 


i Under 50 50 to 69 70 & Over ! Total 
* * 
Tokyo i 55.319 42.553 ! 100.000 47.000 
Boston i 25.000 40.909 i 100.000 44.000 
Glamorgn } 29.091 49.091 ! 100.000 55.000 
---------- *---------------------- * 
Total H 36.301 44.521 19.178 ; 100.000 
N 1 53.000 65.000 28.000 | 1.460Е%002 
SURVIVES = Alive 
TUMOR$ = MaxMalig 
CENTERS (rows) by AGE (columns) 
Total N 


100.000 48.000 
100.000 
100.000 


100.000 


SURVIVES = Dead 
TUMORS = MinBengn 


CENTERS (rows) by AGE(columns) 
i Under 50 50 to 69 70 & Over 


+ 
i i 
| | 
---------- + + 
Total Н 23.333 45.556 31.111 | 100.000 
i 21.000 41.000 28.000 } 90.000 
SURVIVE$ = Dead 
TUMOR$ = MaxBengn 
CENTER$ (rows) by AGE (columns) 
! Under 50 50 to 69 70 & Over | Total N 
cuis DES dE ghey Жыт eee E Ьа РАНЕ. 
Tokyo i 40.000 0.000 ; 100.000 5.000 
Boston H 100.000 0.000 ; 100.000 2.000 
Glamorgn | 0.000 0.000 ! 100.000 0.000 
Sasa) + да яш ^а на” Ж ыыы ана 
Total Н 57.143 0.000 ! 100.000 
N ! 4.000 0.000 | 7.000 
SURVIVES = Dead 
TUMORS = MinMalig 
CENTERS (rows) by AGE (columns) 
i Under 50 50 to 69 Total N 


100.000 20.000 
100.000 23.000 
100.000 


i 45.000 45.000 


40.789 


100.000 
31.000 31.000 


76.000 
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SURVIVES = Dead 
TUMORS = MaxMalig 


CENTERS (rows) by AGE (columns) 


| Under 50 50 to 69 70 & Over 


+ 

1 100.000 16.000 
1 100.000 12.000 
i 100.000 9.000 
* 


i 100.000 


Partial measures 


We can compute partial measures of association on the standardized table generated 
here as in the two-way tables Measures. 


The input is: 


FREQUENCY NUMBER 

LABEL AGE / 50='Under 50', 60='50 to 69', 70='70 & Over' 
ORDER CENTERS / SORT=NONE 

ORDER TUMORS / SORT='MinBengn', 'MaxBengn',  'MinMalig', 
'MaxMalig' 

PLENGTH NONE / FREQ ROWPCT YATES FISHER ODDS YULE PHI CRAMER, 
CONT UNCE LAMBDA LRCHI COCHRAN MCNEM KAPPA RHO GAMMA, 

TAUB TAUC SOMERS 

STD SURVIVES * TUMORS * CENTERS * АСЕ 

REM MEASURES OF TOTAL ASSOCIATION 

PLENGTH NONE / FREQ YATES FISHER ODDS YULE PHI CRAMER CONT, 
UNCE LAMBDA LRCHI COCHRAN MCNEM KAPPA RHO GAMMA TAUB TAUC, 
SOMERS 

TABULATE CENTERS * AGE 

PLENGTH 


The output is: 
Case frequencies determined by value of variable NUMBER 


Standardized Counts After Removing The Effect of Test Factor(s) SURVIVES 
*CENTERS (rows) by *AGE (columns) 


| Under 50 50 to 69 70 & Over Total 

zt EE d Жылкышы сыз SEES RU s d E ear CARS 
Tokyo | 147.927 122.088 19.985 290.000 
Boston Н 58.851 125.131 69.018 253.000 
Glamorgn | 68.929 110.681 41.391 221.000 
ес ЯВ а eso ices T 
Total ! 275.706 357.899 130.394 764.000 
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Case frequencies determined by value of variable NUMBER 
Counts 
CENTERS (rows) by AGE(columns) 


Under 50 50 to 69 70 & Over Total 


ede Қане tiunt nah сенші с LAM sci 
Tokyo Н 151 120 19 290 
Boston i 58 122 73 253 
Glamorgn į 71 109 41 221 
Кылыгын а ЗЕЕ РЭТ ee ee ЧАГО аала 
Тоса1 ! 280 351 133 764 


Case frequencies determined Бу value of variable NUMBER 
Standardized Counts After Removing The Effect of Test Factor(s) SURVIVE$ 
*CENTERS (rows) by *АСЕ (columns) 


! Under 50 50 to 69 70 & Over Total 
REEL вага ынан НД ue Dd Sa aC at ai 
Tokyo i 147.927 122.088 19.985 290.000 
Boston i 58.851 125.131 69.018 253.000 
Glamorgn } 110.681 221.000 
Total | 275.706 357.899 764.000 


df -value 
Likelihood Ratio Chi-square | 69.058 4.000 0.000 
McNemar Symmetry Chi-square | 58.702 3.000 0.000 


Partial Measures of Association for *CENTER$ and “АСЕ 


95 $ Confidence Interval 


] 
i 
i 

араас естт SE Si fin! Rr Se eee + cc iio pU RENE Sac 


Coefficient ASE Lower Upper 
Phi 

Cramer's V 

Contingency 

Goodman-Kruskal's Gamma 0.046 0.194 0.375 
Kendall's tau-b 0.029 0.131 0.243 
Stuart's tau-c 0.030 0.121 0.238 
Cohen's kappa 0.026 0.056 0.157 
Spearman's rho {0.209 0.034 0.142 0.276 
Somers'd (column dependent) | 0.181 0.030 0.122 0.240 


Somers'd (row dependent) 


Lambda (column dependent) -0.064 0.191 
Lambda (row dependent) 0.038 0.181 
Lambda (Symmetric) -0.013 0.186 
Uncertainty (column dependent) 0.024 0.064 
Uncertainty (row dependent) 0.023 0.060 
Uncertainty (Symmetric) 0.023 0.062 
Coefficient 

Phi 

Cramer's V 

Contingency 


Goodman-Kruskal's Gamma 
Kendall's tau-b 

Stuart's tau-c 

Cohen's kappa 

Spearman's rho 

Somers'd (column dependent) 
Somers'd (row dependent) 
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Lambda (column dependent) 
Lambda (row dependent) 

Lambda (Symmetric) 
Uncertainty (column dependent) 
Uncertainty (row dependent 
Uncertainty (Symmetric) 


0.981 
3.010 
1.711 
4.377 
4.337 
4.358 


0.327 
0.003 
0.087 
0.000 
0.000 
0.000 


Case frequencies determined by value of variable NUMBER 


Counts 
CENTERS (rows) by AGE (columns) 
Under 50 50 to 69 


i 
+ 
Boston Н 
i 
+ 
i 


70 & Over 


Chi-square Tests of Association for CENTER$ and AGE 


П 
i 

Test Statistic | Value df p-value 

henaa 1С E ea ie Sale coy P MET 

Likelihood Ratio Chi-square | 76.963 4.000 0.000 

McNemar Symmetry Chi-square | 58.761 3.000 0.000 


Measures of Association for CENTERS and AGE 


Coefficient 


Lower 


i 
+ 
Phi i 
Cramer's V i 
Contingency Н 
Goodman-Kruskal's Gamma i 
Kendall's tau-b ! 
Stuart's tau-c i 
Cohen's kappa i 
Spearman's rho Н 
Somers'd (column dependent) 1 
Somers'd (row dependent) i 
Lambda (column dependent) 1 
Lambda (row dependent) i 
Lambda (Symmetric) i 
Uncertainty (column dependent) | 
Uncertainty (row dependent) Н 
Uncertainty (Symmetric) Џ 


Coefficient i 


Phi 

Cramer's V 

Contingency 
Goodman-Kruskal's Gamma 
Kendall's tau-b 
Stuart's tau-c 

Cohen's kappa 
Spearman's rho 
Somers'd (column dependent) 
Somers'd (row dependent) 
Lambda (column dependent) 
Lambda (row dependent) 
Lambda (Symmetric) 
Uncertainty (column dependent) 
Uncertainty (row dependent) 
Uncertainty (Symmetric) 


p-value 


--%---------------- 


Upper 
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The null hypothesis for the Pearson chi-square test is that the table factors are 
independent. You reject the hypothesis (chi-square = 76.963, p-value < 0.0005). 


Notice that Partial association measures reports a high degree of association between 
CENTERS and AGE as compared to total association where there is a sort of imbalance 
between CENTER$ and AGE. 


List Layout 
To create a listing of the counts in each cell of the table. 


The input is: 

PLENGTH NONE/ LIST 

TABULATE SURVIVE$ * CENTERS * AGE * TUMORS 
The output is: 


Case Frequencies Determined by Value of Variable NUMBER 
Frequency Distribution Table 


SURVIVES CENTERS AGE TUMORS i Frequency Cumulative Percent 
| Frequency 
ГЭЕ ЗЕЕ орос RN 
Alive Tokyo Under 50 MinBengn ! 68 68 8.901 
Alive Tokyo Under 50 MaxBengn } 9 id 1.178 
Alive Tokyo Under 50 MinMalig | 26 103 3.403 
Alive Tokyo Under 50 MaxMalig | 25 128 3.272 
Alive Tokyo 50 to 69 MinBengn | 46 174 6.021 
Alive Tokyo 50 to 69 MaxBengn | 5 179 0.654 
Alive Tokyo 50 to 69 MinMalig | 20 199 2.618 
Alive Tokyo 50 to 69 MaxMalig | 18 217 2.356 
Alive Tokyo 70 & Over  MinBengn ! 6 223 0.785 
Alive Tokyo 70 & Over  MaxBengn | 1 224 0.131 
Alive Tokyo 70 & Over MinMalig | 1 225 0.131 
Alive Tokyo 70 & Over  MaxMalig ! 5 230 0.654 
Alive Boston Under 50 MinBengn | 24 254 3.141 
Alive Boston Under 50 MinMalig ! 11 265 1.440 
Alive Boston Under 50 MaxMalig | 4 269 0.524 
Alive Boston 50 to 69 MinBengn } 58 327 1.592 
Alive Boston 50 to 69 MaxBengn | 3 330 0.393 
Alive Boston 50 to 69 MinMalig ! 18 348 2.356 
Alive Boston 50 to 69 MaxMalig | 10 358 1.309 
Alive Boston 70 & Over  MinBengn ! 26 384 3.403 
Alive Boston 70 & Over MaxBengn | 1 385 0.131 
Alive Boston 70 & Over MinMalig ! 15 400 1.963 
Alive Boston 70 & Over  MaxMalig } 1 401 0.131 
Alive Glamorgn Under 50 MinBengn | 20 421 2.618 
Alive Glamorgn Under 50 MaxBengn | 1 422 0.131 
Alive Glamorgn Under 50 MinMalig | 16 438 2.094 
Alive Glamorgn Under 50 MaxMalig ! 8 446 1.047 
Alive Glamorgn 50 to 69 MinBengn ! 39 485 5.105 
Alive Glamorgn 50 to 69 MaxBengn | 4 489 0.524 
Alive Glamorgn 50 to 69 MinMalig | 21 516 3.534 
Alive Glamorgn 50 to 69 MaxMalig ! 10 526 1.309 
Alive Glamorgn 70 & Over MinBengn | 11 537 1.440 
Alive Glamorgn 70 & Over MaxBengn | 1 538 0.131 
Alive Glamorgn 70 & Over MinMalig ! 12 550 1.571 
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Alive Glamorgn 70 & Over MaxMalig | 4 554 0.524 
Dead Tokyo Under 50 MinBengn | 7 561 0.916 
Dead Tokyo Under 50 MaxBengn | 3 564 0.393 
Dead Tokyo Under 50 MinMalig ! 9 573 1.178 
Dead Tokyo Under 50 MaxMalig | 4 571 0.524 
Dead Tokyo 50 to 69 MinBengn | 9 586 1.178 
Dead Tokyo 50 to 69 MaxBengn | 2 588 0.262 
Dead Tokyo 50 to 69 MinMalig | 9 597 1.178 
Dead Tokyo 50 to 69 MaxMalig | 11 608 1.440 
Dead Tokyo 70 & Over  MinBengn | 3 611 0.393 
Dead Tokyo 70 & Over  MinMalig | 2 613 0.262 
Dead Tokyo 70 & Over MaxMalig | 1 614 0.131 
Dead Boston Under 50 MinBengn | 1 621 0.916 
Dead Boston Under 50 MinMalig | 6 627 0.785 
Dead Boston Under 50 MaxMalig | 6 633 0.785 
Dead Boston 50 to 69 MinBengn | 20 653 2.618 
Dead Boston 50 to 69 MaxBengn | 2 655 0.262 
Dead Boston 50 to 69 MinMalig | 8 663 1.047 
Dead Boston 50 to 69 MaxMalig | 3 666 0.393 
Dead Boston 70 & Over MinBengn | 18 684 2.356 
Dead Boston 70 & Over MinMalig | 9 693 1.178 
Dead Boston 70 & Over  MaxMalig | 3 696 0.393 
Dead Glamorgn Under 50 MinBengn | 7 703 0.916 
Dead Glamorgn Under 50 MinMalig | 16 719 2.094 
Dead Glamorgn Under 50 MaxMalig ! 3 722 0.393 
Dead Glamorgn 50 to 69 MinBengn | 12 734 1.571 
Dead Glamorgn 50 to 69 MinMalig | 14 748 1.832 
Dead Glamorgn 50 to 69 MaxMalig | 3 751 0.393 
Dead Glamorgn 70 5 Over  MinBengn | 7 758 0.916 
Dead Glamorgn 70 & Over  MinMalig | 3 761 0.393 
Dead Glamorgn 70 & Over МахМа114 | 3 764 0.393 
SURVIVES CENTERS AGE TUMORS ; Cumulative 


Alive Tokyo Under 50 | MinBengn ! 8.901 
Alive Tokyo Under 50 MaxBengn | 10.079 
Alive Tokyo Under 50  MinMalig ! 13.482 
Alive Tokyo Under 50 MaxMalig | 16.754 
Alive Tokyo 50 to 69 MinBengn | 22.715 
Alive Tokyo 50 to 69 MaxBengn | 23.429 
Alive Tokyo 50 to 69 MinMalig | 26.047 
Alive Tokyo 50 to 69 MaxMalig | 28.403 
Alive Tokyo 70 & Over  MinBengn | 29.188 
Alive Tokyo 70 & Over  MaxBengn | 29.319 
Alive Tokyo 70 & Over  MinMalig ! 29.450 
Alive Tokyo 70 & Over  MaxMalig | 30.105 
Alive Boston Under 50 MinBengn | 33.246 
Alive Boston Under 50 MinMalig | 34.686 
Alive Boston Under 50 MaxMalig | 35.209 
Alive Boston 50 to 69 MinBengn | 42.801 
Alive Boston 50 to 69 MaxBengn | 43.194 
Alive Boston 50 to 69 MinMalig | 45.550 
Alive Boston 50 to 69 MaxMalig | 46.859 
Alive Boston 70 & Over  MinBengn ; 50.262 
Alive Boston 70 & Over  MaxBengn | 50.393 
Alive Boston 70 & Over  MinMalig | 52.356 
Alive Boston 70 & Over  MaxMalig ; 52.487 
Alive Glamorgn Under 50 MinBengn | 55.105 
Alive Glamorgn Under 50 MaxBengn | 55.236 
Alive Glamorgn Under 50 MinMalig | 57.330 
Alive Glamorgn Under 50 MaxMalig | 58.377 
Alive Glamorgn 50 to 69 MinBengn | 63.482 
Alive Glamorgn 50 to 69 MaxBengn | 64.005 
Alive Glamorgn 50 to 69 MinMalig | 67.539 
Alive Glamorgn 50 to 69 MaxMalig | 68.848 
Alive Glamorgn 70 6 Over MinBengn | 70.288 
Alive Glamorgn 70 & Over MaxBengn | 70.419 
Alive Glamorgn 70 & Over MinMalig | 71.990 
Alive Glamorgn 70 & Over  MaxMalig ! 72.513 
Dead Tokyo Under 50 MinBengn | 73.429 
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Dead Tokyo Under 50 MaxBengn | 73.822 

Dead Tokyo Under 50 MinMalig | 75.000 

Dead Tokyo Under 50 MaxMalig | 75.524 

Dead Tokyo 50 to 69 MinBengn | 76.702 

Dead Tokyo 50 to 69 MaxBengn | 76.963 

Dead Tokyo 50 to 69 MinMalig ! 78.141 

Dead Tokyo 50 to 69 MaxMalig | 79.581 

Dead Tokyo 70 & Over  MinBengn | 79.974 

Dead Tokyo 70 & Over MinMalig | 80.236 

Dead Tokyo 70 & Over MaxMalig | 80.366 

Dead Boston Under 50 MinBengn | 81.283 

Dead Boston Under 50 MinMalig | 82.068 

Dead Boston Under 50 MaxMalig | 82.853 

Dead Boston 50 to 69 MinBengn | 85.471 

Dead Boston 50 to 69 MaxBengn | 85.733 

Dead Boston 50 to 69 MinMalig | 86.780 

Dead Boston 50 to 69 MaxMalig | 87.173 

Dead Boston 70 & Over MinBengn | 89.529 

Dead Boston 70 & Over  MinMalig | 90.707 

Dead Boston 70 & Over  MaxMalig | 91.099 

Dead Glamorgn Under 50 MinBengn | 92.016 

Dead Glamorgn Under 50 MinMalig | 94.110 

Dead Glamorgn Under 50 MaxMalig | 94.503 

Dead Glamorgn 50 to 69 MinBengn | 96.073 

Dead Glamorgn 50 to 69 MinMalig | 97.906 

Dead Glamorgn 50 to 69 MaxMalig | 98.298 

Dead Glamorgn 70 & Over MinBengn | 99.215 

Dead Glamorgn 70 & Over MinMalig | 99.607 

Dead Glamorgn 70 & Over MaxMalig | 100.000 
The 35 cells for the women who survived are listed first (the cell for Boston women 
under 50 years old with MaxBengn tumors is empty). In the Cumulative Percent 
column, we see that these women make up 72.5% of the sample. Thus, 27.5% did not 
survive. 

Percentages 

While list layout provides percentages of the total table count, you might want others. 
Here we specify COLPCT in XTAB to print the percentage surviving within each 
age-by-center stratum. 
The input is: 


PLENGTH NONE / COLPCT 


TABULATE AGE * CENTERS * SURVIVES * TUMORS 
PLENGTH 
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The output is: 


Case Frequencies Determined by Value of Variable NUMBER 


Column Percents 
AGE = Under 50 
CENTER$ = Tokyo 


SURVIVES (rows) by TUMORS (columns) 


MaxBengn MinMalig 
75.000 74.286 
25.000 25.714 


100.000 
35.000 


AGE = Under 50 
CENTER$ = Boston 


SURVIVES (rows) by TUMORS (columns) 


| MinBengn 
------- + 
Alive | 77.419 0.000 64.706 
Dead | 22.581 0.000 35.294 
Coane ПИ icq 
Total | 100.000 100.000 100.000 
N i 31.000 0.000 17.000 


AGE = Under 50 
CENTERS = Glamorgn 


SURVIVE$ (rows) by TUMOR$ (columns) 


| MinBengn MaxBengn MinMalig 
TENA 4----------------с-с------------------------ 
Alive | 50.000 
Dead | 50.000 
------- + 
Total } 100.000 100.000 100.000 
N i 27.000 1.000 32.000 


AGE = 50 to 69 
CENTER$ = Tokyo 


SURVIVE$ (rows) by TUMOR$ (columns) 


| MinBengn  MaxBengn MinMalig 

* 
Alive | 83.636 71.429 68.966 
Dead | 16.364 28.571 31.034 
Parral" Ey Re aes i D dabat Se 
Total | 100.000 100.000 100.000 
N i 55.000 7.000 29.000 


AGE = 50 to 69 
CENTER$ = Boston 


SURVIVES (rows) by TUMOR$ (columns) 


| MinBengn MaxBengn МіпМа114 
------- + 
Alive | 74.359 60.000 69.231 
Dead | 25.641 40.000 30.769 
Total 100.000 100.000 100.000 
N i 78.000 5.000 26.000 


MaxMalig 


86.207 
13.793 


100.000 
29.000 


40.000 
60.000 


100.000 
10.000 


MaxMalig 


72.727 
27.273 


100.000 
11.000 


MaxMalig 


62.069 
37.931 
100.000 
29.000 


MaxMalig 


23.077 


100.000 
13.000 


етке ЧЕ 


23.000 


‚ 100.000 
| 151.000 


! 67.241 39.000 

! 32.759 19.000 

P ану... ле ы жї 

| 100.000 

i 58.000 

|! Total N 

VEDI 2 PEER 

|! 63.380 45.000 

! 36.620 26.000 

тий сл А ма њени d 

| 100.000 

i 71.000 
Total N 
74.167 89.000 


i 120.000 


100.000 
122.000 
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AGE = 50 to 69 
CENTERS = Glamorgn 


SURVIVES (rows) by TUMORS (columns) 


MinBengn MaxBengn MinMalig MaxMalig 


i i 
Eu %-------2----------------------- + 
Alive | 76.471 100.000 
Dead | 23.529 0.000 i 
------- %----------------------------------4--------% 
Total ! 100.000 100.000 100.000 100.000 ; 100.000 
N i 51.000 4.000 41.000 13.000 | 109.000 
AGE = 70 & Over 
CENTER$ = Tokyo 
SURVIVE$ (rows) by TUMOR$ (columns) 
MinBengn MaxBengn MinMalig MaxMalig Total N 


------- + + 
Alive | i 
Dead , П 

------- %------------------------------------ 

Total ; 100.000 100.000 100.000 
N i 9.000 1.000 3.000 


AGE = 70 & Over 
CENTER$ = Boston 


SURVIVE$ (rows) by TUMOR$ (columns) 


| MinBengn МахВелап MinMalig MaxMalig | Total N 
reos meadow apport NN NORTE em cette ар an idi ке or uU qe 
! 59.091 100.000 25.000 | 58.904 43.000 
i 0 75.000 | 41.096 30.000 
%------------------------------------------- %----------------- 
Total | 100.000 100.000 100.000 100.000 ! 100.000 
N | 44.000 1.000 24.000 4.000 ! 73.000 
AGE = 70 & Over 
CENTERS * Glamorgn 
SURVIVE$ (rows) by TUMOR$ (columns) 
| MinBengn MaxBengn MinMalig MaxMalig | Total N 
рене ӛ------------------------22..................................- 
Alive | 61.111 100.000 80.000 57.143 | 68.293 28.000 
Dead | 38.889 0.000 20.000 42.857 | 31.707 13.000 
ids din елак н ЕНИН emm E o m im me imi mms на nao aii а 
Total | 100.000 100.000 100.000 100.000 ; 100.000 
N i 18.000 1.000 15.000 7.000 ; 41.000 


The percentage of women surviving for each age-by-center combination is reported in 
the first row of each panel. In the marginal Total down the right column, we see that 
the younger women treated in Tokyo have the best survival rate (84.77%). This is the 
row total (128) divided by the total for the stratum (151). 
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Example 15 
Multi Way: Standardize Tables 


When you have three or more table factors, multiway: Standardize tables can be used 
to standardize multiway tables and to compare an original total association of two 
variables with the same association where the effects of some test factor(s) have been 
statistically removed. Your input can be usual cases-by-variables data file. 


The input is: 


USE SURVEY2 

XTAB 

PLENGTH NONE/ FREQ EXPECT STAND DEVI ROWP COLP PERCENT 
STD SEX * MARITAL * EDUCATN/ SHADE = 3 


The last two factors selected (MARITAL and EDUCATN) define the standardized 
two-way tables after removing the effect of test factor SEX. You can observe the 
various tables which are shaded based on the values of standardized residuals and the 
threshold value=3. 
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The output is: 


Counts after removing the effect of test Factor(s) SEX 
“WARITAL(rows) by EA 


Percents after removing the effect of test Factor(s) SEX 
*MARITAL(rows) by *EDUCATN(columns) 


Row Percents after removing the effect of test Factor(s) SEX 
*MARITAL(rows) by *EDUCATN(columns) 
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Column Percents after removing the effect of test Factor(s) SEX 
“MARITAL (rows) by "EDUCATN(columns) 


Expected Counts after removing the effect of test Factor(s) SEX 
*MARITAL(rows) by "EDUCATN(columns) 


Deviates after removing the effect of test Factor(s) SEX 
*MARITAL(rows) by *EDUCATN(columns) 


Standardized Deviates after removing the effect of test Factor(s) SEX 
*MARITAL(rows) by *EDUCATN(columns) 


Example 16 
Mantel-Haenszel Test 


For any (k x 2 x 2) table, if the output mode is MEDIUM or if you select the Mantel- 
Haenszel test, SYSTAT produces the Mantel-Haenszel statistic without continuity 
correction. This tests the association between two binary variables controlling fora 
stratification variable. The Mantel-Haenszel test is often used to test the effectiveness 
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ofa treatment оп an outcome, to test the degree of association between the presence or 
absence of a risk factor and the occurrence of a disease, or to compare two survival 
distributions. 

A study by Ansfield, et al. (1977) examined the responses of two different groups 
of patients (colon or rectum cancer and breast cancer) to two different treatments: 


САМСЕК8 TREAT$ RESPONSES NUMBER 


Colon-Rectum a Positive 16.000 
Colon-Rectum b Positive 7.000 
Colon-Rectum a Negative 32.000 
Colon-Rectum b Negative 45.000 
Breast a Positive 14.000 
Breast b Positive 9.000 
Breast a Negative 28.000 
Breast b Negative 29.000 


Here are the data rearranged: 


Breast Cancer Colon-Rectum 
Positive Negative | Positive Negative 
Treatment A 14 28 16 32 
Treatment B 9 29 7 45 


The odds ratio (cross-product ratio) for the first table is: 


odds (biopsy positive, given treatment A) - 14/28 
odds (biopsy positive, given treatment B) - 9/29 


or 


Similarly, for the second table, the odds ratio is: 


16/32 _ 


7/45 = 
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If the odds for treatments A and B are identical, the ratios would both be 1.0. For these 
data, the breast cancer patients on treatment A аге 1.6 times more likely to have a 
positive biopsy than patients on treatment B; while, for the colon-rectum, those on 
treatment A are 3.2 times more likely to have a positive biopsy than those on treatment 
В. But can you say these estimates differ significantly from 1.0? After adjusting for the 
total frequency in each table, the Mantel-Haenszel statistic combines odds ratios across 
tables. 


The input is: 


USE ANSFIELD 
XTAB 
FREQUENCY NUMBER 
ORDER RESPONSES / SORT='Positive', 'Negative’ 
PLENGTH NONE / MANTEL FREQ 
TABULATE CANCERS * TREATS * RESPONSES 


The stratification variable (CANCERS) must be the first variable listed on TABULATE. 


The output is: 
Case Frequencies Determined by Value of Variable NUMBER 


Counts 
CANCERS - Colon-Rectum 


TREAT$(rows) by RESPONSE$ (columns) 


| Positive Negative Total 


a ! 16 32 48 
b f 7 45 52 
Em ИНИ Еле пи 
Total } 23 77 100 


CANCERS = Breast 
TREAT$ (rows) by RESPONSE$ (columns) 


| Positive Negative Total 


a i 14 28 42 

b i 9 29 38 
ЕЕ" qo Dheu КЕНЕ SUR 

Total | 23 57 80 
Mantel-Haenszel Statistic за AT 
Mantel-Haenszel Chi-square : 4.739 
p-value : 0.029 


SYSTAT prints a chi-square test for testing whether this combined estimate equals 1.0 
(that odds for A and B are the same). The probability associated with this chi-square is 
0.029, so you reject the hypothesis that the odds ratio is 1.0 and conclude that treatment 
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Ais less effective—more patients on treatment А have positive biopsies after treatment 
than patients on treatment B. 

The Mantel-Haenszel chi-square test examines the hypothesis that the odds ratios 
are homogenous across tables. For your example, the second odds ratio is twice as 
large as the first. You can use loglinear models to test if a cancer-by-treatment 
interaction is needed to fit the cells of the three-way table defined by cancer, treatment, 


and response. The difference between this model and one without the interaction was 
not significant (p-value =0.029). 
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Descriptive Statistics 


Leland Wilkinson and Laszlo Engelman 
(revised by Ravindra Jore) 


There are many ways to describe data, although not all descriptors are appropriate for 
a given sample. Means and standard deviations are useful for data that follow a 
normal distribution, but are poor descriptors when the distribution is highly skewed 
or has outliers, subgroups, or other anomalies. Some statistics, such as the mean and 
median, describe the center of a distribution. These estimates are called measures of 
location. Others, such as the standard deviation, describe the spread of the 
distribution. 

Before deciding what you want to describe (location, spread, and so on), you 
should consider what type of variables are present. Are the values of a variable 
unordered categories, ordered categories, counts, or measurements? 

For many statistical purposes, counts are treated as measured variables. Such 
variables are called quantitative if it makes sense to do arithmetic on their values. 

Means and standard deviations are appropriate for quantitative variables that 
follow a normal distribution. Often, however, real data do not meet this assumption 
of normality, A descriptive statistic is called robust if the calculations are insensitive 
to violations of the assumption of normality. Robust measures include the median, 
quartiles, frequency counts, and percentages. 

If you would like extreme observations (outliers) not to exert too much influence 
on your descriptors, you may like to trim the data, whereby a specified proportion of 
data on one or both extremes is not considered for computing the descriptors; hence 
the term, Trimmed Mean. The trimmed mean is not so efficient for normally 
distributed data, but if the distribution is skewed it is less sensitive to sampling 
fluctuations. 

Before requesting descriptive statistics, first scan graphical displays to see if the 
shape of the distribution is symmetric, if there are outliers, and if the sample has 
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subpopulations. If the latter is true, then the sample is not homogeneous, and the 
statistics should be calculated for each subgroup separately. 

Generally, data are presented in a rectangular format with columns representing 
variables and rows representing cases. Almost always, descriptive statistics are needed 
for the variables and such statistics are called column statistics. Occasionally, 
descriptive statistics are needed for cases or rows. For instance, if your data set consists 
of scores in a number of similar tests (columns) on a list of students (cases) and if you 
wish to find the average score and the variation of each student, you would want row 
statistics. 

Descriptive Statistics offers basic statistics and stem-and-leaf plots for columns as 
well as rows. The basic statistics are number of observations (N), minimum, maximum, 
arithmetic mean (AM), geometric mean, harmonic mean, sum, standard deviation, 
variance, coefficient of variation (CV), range, median, standard error of AM, etc. 

Besides the above descriptors, the trimmed mean can also be computed for columns 
and rows. Here the user specifies whether left-sided (lower), right-sided (upper), or 
two-sided trimming is required and specifies the proportion p of data to be removed. 
In two-sided trimming, a proportion p of data is removed from each side. 

A confidence interval for the mean (based on the normal distribution, with a default 
confidence coefficient of 0.95, which can be changed by the user) and skewness and 
kurtosis measures with their standard errors (SES, SEK) can also be opted for. Along 
with all the above options, Shapiro-Wilk and Anderson-Darling tests for normality can 
also be performed. For multivariate data, Mardia's skewness and kurtosis coefficients 
and asymptotic tests of significance on them, and the Henze-Zirkler test are available. 
N-tiles and P-tiles are also available with seven different algorithms and an associated 
transformation of the data to an N-tile class can be requested. 

A stem-and-leaf plot is available for assessing distributional shape and identifying 
outliers. Moreover, Descriptive Statistics provide stratified analyses - that is, you can 
request results separately for each level of grouping variable (such as SEX$) or for each 
combination of levels of two or more grouping variables. 

Resampling procedures are available with this feature. 

Under Basic Statistics, if you choose any of the resampling options, then SYSTAT 
gives a summarization based on resampling. You can opt for the following: mean, 
median, variance, standard deviation, skewness, and kurtosis. You can get resampling 
estimates along with their bias and standard error. Under bootstrap, you will also get 
confidence intervals for the corresponding parameters using two popular methods, 
viz., Percentile method and Bias corrected and accelerated method. 
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Statistical Background 


Descriptive statistics are numerical summaries of batches of numbers. Inevitably, these 
summaries are misleading, because they mask details of the data. Without them, 
however, we would be lost in particulars. 

There are many ways to describe a batch of data. Not all are appropriate for every 
batch, however. Let us look at the Who's Who data from Chapter 1 to see what this 
means. First of all, here is a stem-and-leaf diagram of the ages of 50 randomly sampled 
people from Who's Who. A stem-and-leaf diagram is a tally; it shows us the 
distribution of the AGE values. 


Stem and leaf plot of variable: AGE, N = 50 


Minimum: 34.000 
Lower hinge: 49.000 
Median: 56.000 
Upper hinge: 66.000 
Maximum: 81.000 
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Notice that these data look fairly symmetric and lumpy in the middle. A natural way to 
describe this type of distribution would be to report its center and the amount of spread. 


Location 


How do we describe the center, or central location of the distribution, on a scale? One 
way is to pick the value above which half of the data values fall and, by implication, 

below which the other half of the data values fall. This measure is called the median. 
For our AGE data, the median age is 56 years. 

When there are extreme values or outliers present in the data, the arithmetic mean 
(AM) will be affected by the extreme observations and thus will not be a suitable 
measure of central tendency. The median is computed based only on the central one or 
two values and does not depend on the values of other observations. A trimmed mean 
is a compromise between AM and the median. However, it is not an easy task to decide 


on a suitable trimming proportion. 
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The geometric mean (GM) is a suitable measure of central tendency when the 
quantities involved are multiplicative in nature, such as rate of population growth, 
interest rate, etc. For example, suppose an investment earns an interest of 5% in the 
first year, 15% in the second, and 25% in the third. Then the investor may be interested 
in the ‘average annual interest percentage. Evidently, we want the answer to be such 
а number у that if the annual interest rate of у applies uniformly over the three years, 
then the final return is the same as that given by the differential interest rates 
mentioned. Thus the average у we are seeking is such that 1.05 *1.15 * 1.25 = 
1.50937500 = (1+у)?, making (1+у) = (1.10 * 1,15 * 1.25)!» 

= the geometric mean of the three positive numbers 1.10, 1.15, 1.25, which is 
1.147094113. Notice that (1.147094113)3 = 1.509375001. So the value of y is 
0.147094113 or 14.7094113%. The GM of positive numbers is defined to be the nth 
root of the product of the numbers. GM can also be defined and computed for weighted 
variables. Let 21,2; ,...,z, ben positive numbers with positive weights w, w», ...., 
w, respectively, then GM is 


n n 
При 7 у» 


fel LES | 


The harmonic mean (HM) is a suitable measure of central tendency when the 
quantities involved are rates. For example, a person drove a car for 100 kilometer (km) 
of which he maintained a speed of 50 km/hr for the first 25 km, 40 km/hr for the next 
25 km, 45 km/hr for the next 25 km and 55 km/hr for the last 25 km. Then һе has spent 
25(1/50 + 1/40 + 1/45 + 1/55) hours on a 100 km journey making the average speed 
4/ (1/50 + 1/40 + 1/45 + 1/55) = 43.83619, which is the harmonic mean of the four 
speeds. The harmonic mean of n numbers is the reciprocal of the arithmetic mean of 
the reciprocal of these п numbers. 

HM can be defined for the weighted variable as 


n 


Y" £ Уи) 


іт | i-l 


1-301 


Spread 


Descriptive Statistics 


One way to measure spread is to take the difference between the largest and smallest 
value in the data. This is called the range. For the age data, the range is 47 years. 
Another measure, called the interquartile range or midrange, is the difference between 
the values at the limits of the middle 50% of the data. For AGE, this is 17 years. (Using 
the statistics at the top of the stem-and-leaf display, subtract the /ower hinge from the 
upper hinge). Still another way to measure would be to compute the average variability 
in the values, The standard deviation is the square root of the average squared 
deviation of values from the mean. For the AGE variable, the standard deviation is 
11.62. Some output for the Who's Who data is: 


N of Cases i 50 
Arithmetic Mean 1 56.700 
Standard Deviation ! 11.620 


The Normal Distribution 


All of these measures of location and spread have their advantages and disadvantages, 
but the mean and standard deviation are especially useful for describing data that 
follow a normal distribution. The normal distribution is a mathematical curve with 
only two parameters in its equation: the mean and standard deviation. As you recall 
from Chapter 1, a parameter defines a family of mathematical functions, all of which 
have the same general shape. Thus, if data come from a normal distribution, we can 
describe them completely (except for random variation) with only a mean and standard 
deviation. 

Let us see how this works for our AGE data. Shown in the next figure is a histogram 
of AGE with the normal curve superimposed. The location (center) of this curve is at 
the mean age of the sample (56.7), and its spread is determined by the standard 
deviation (11.62). 
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Тһе fit of the curve to the data looks excellent. Let us examine the fit in more detail. 
For a normal distribution, we would expect 68% of the observations to fall between 
one standard deviation below the mean and one standard deviation above the mean 
(45.1 to 68.3 years). By counting values in the stem-and-leaf diagram, we find 34 
cases—on target. This is not to say that every number follows a normal distribution 
exactly, however. If we looked further, we would find that the tails of this distribution 
are slightly shorter than those from a normal distribution, but not enough to worry. 


Test for Normality 


A more formal way of finding out if the normal distribution describes the data well is 
to carry out a statistical test of hypothesis. The Shapiro-Wilk test (Shapiro and 

Wilk, 1965) is a standard test for normality used when the sample size is between 3 and 
5000. The p-value given by this test is an indication of how good the fit is---the smaller 
the p-value is, the worse is the fit. Generally, p-values of the order of 0.05 or 0.01 are 
considered small enough to declare the fit poor. For a set of observations ху, x», .. X, 
sorted in either ascending or descending order as (х2): 


Shapiro-Wilk's W statistic is given by: 


n 2 n — 
W -| Sar | Y -xy 
із! ізі 
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where a = (а), аз, .., аһ) = m'V бо v!mj]95, m denotes a vector of expected 
values of standard normal order statistics, and V denotes the corresponding covariance 
matrix. 


For the AGE data above, the following are the results of the Shapiro-Wilk normality 


test: 

AGE 
Shapiro-Wilk Statistic | 0.980 
Shapiro-Wilk p-value [04532 


Since the p-value is very high, the normal distribution is considered quite suitable to 
describe the age data. This agrees with the informal assessment made above using the 
histogram. 


The Anderson-Darling test (Anderson and Darling, 1952, 1954) is a standard 
goodness-of-fit test. It is designed to test whether the given data arise from a given 
distribution. The test statistic is given as: 


А, =n* Jiro) - F, (n)? ЕО) * 0- РО) dF (x) 


-0 


where Е,(х) is the proportion of sample points less than or equal to x in a sample of size 
n. It gives greater importance to the observations in tails than those at the center. 


Multivariate Normality Assessment 


Mardia's skewness and kurtosis coefficients (Mardia, 1970), and tests of significance 
of these coefficients using asymptotic distributions, are useful for multivariate 
normality assessment. Also, one may use the Henze-Zirkler test statistic (Henze and 
Zirkler, 1990) and its associated p-value using lognormal distribution. 


Non-Normal Shape 


Before you compute means and standard deviations on everything in sight, however, 
let us take a look at some more data: the USDA TA data. The following are histograms 
for the first two variables, ACCIDENT and CARDIO: 
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Notice that the normal curves fit the distributions poorly. ACCIDENT is positively 
skewed. That is, it has a long right tail. CARDIO, on the other hand, is negatively 
skewed. It has a long left tail. The means (44.3 and 398.5) clearly do not fall in the 
centers of the distributions. Furthermore, if you calculate the medians using the Stem 
display, you will see that the mean for ACCIDENT is pulled away from the median 
(41.9) toward the upper tail and the mean for CARDIO is pulled to the left of the 
median (416.2). The poor fit of the normal distributions is also borne out by the 
following results of Shapiro-Wilk test: 


| ACCIDENT CARDIO 

------................. %--------..1..2..... 
Shapiro-Wilk Statistic ! 0.906 0.913 
Shapiro-Wilk p-value Н 0.001 0.001 


In short, means and standard deviations аге not good descriptors for non-normal data. 
In these cases, you have two alternatives: either transform your data to look normal, or 
find other descriptive statistics that characterize the data. If you log the values of 
ACCIDENT, for example, the histogram looks quite normal. If you square the values 
of CARDIO, the normal fit similarly improves. 

If a transformation does not work, then you may be looking at data that come from 
a different mathematical distribution or are mixtures of subpopulations (see below). 
The probability plots in SYSTAT can help you identify certain mathematical 
distributions. There is not room here to discuss parameters for more complex 
probability distributions. Otherwise, you should turn to distribution-free summary 
Statistics to characterize your data: the median, range, minimum, maximum, midrange, 
quartiles, and percentiles, 
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Subpopulations 


Sometimes, distributions can look non-normal because they are mixtures of different 
normal distributions. Let us look at the Fisher/Anderson /R/S flower measurements. 
The following is a histogram of PETALLEN (petal length) smoothed by a normal 
curve: 


зең sed иоџодоја 


We forgot to notice that the petal length measurements involve three different flower 
species. You can see one of them at the left. The other two are blended at the right. 
Computing a mean and standard deviation on the mixed data is misleading. 


The following box plot, split by species, shows how different the subpopulations are: 


PETALLEN 
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When there are such differences, you should compute basic statistics by group. If you 
want to go on to test whether the differences in subpopulation means are si gnificant, 
use analysis of variance. 

But first notice that the Setosa flowers (Group 1) have the shortest petals and the 
smallest spread; while the Virginica flowers (Group 3) have the longest petals and 
widest spread. That is, the size of the cell mean is related to the size of the cell standard 
deviation. This violates the assumption of equal variances necessary for a valid 
analysis of variance. 

Here, we log transform the plot scale: 


3 


2 
SPECIES 


The spreads of the three distributions are now more similar, For the analysis, we should 
log transform the data. 
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Descriptive Statistics in SYSTAT 


Basic Statistics Dialog Box 
To open the Basic Statistics dialog box, from the menus choose: 


Analyze 
Basic Statistics... 


Analyze: Basic Statistics 


| Man | u-&P Ties] Res 


Available variable(s}: Selected variable(s}: 
| РОР 1983 

РОР 1986 

РОР 1390 

POP. 2020 


HIDDAM 


LL им 

Options 

(At options 

Мм [Geometric mean (GM) Е Range 

[У] Minimum [0 Harmonic mean (НМ) D Variance 

[V] Maximum [0 Trimmed mean: C Skewness 

E Sum [01 __| sa + | ESE of skewness 
[7] Arithmetic mean (AM) С Median (О Kurtosis 
[Г]5Е of AM Msp CSE of kurtosis 
Павлм(155 | Ow 

[Г] Shapiro-Wilk normality test [Г] Anderson-Darling normality test 
Multivariate normality assessment - лей жады 

О мааа skewness [Г] Mardia kurtosis [Г] HenzeZirkler test 


[Save statistics Уйан ~ | 


The following statistics are available: 
= All options. Calculate all available statistics. 
= М. The number of non-missing values for the variable. 
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Minimum. The smallest non-missing value. 
Maximum. The largest non-missing value. 
Sum. The total of all non-missing values of a variable. 


Arithmetic mean (AM). The arithmetic mean of a variable- the sum of the values 
divided by the number of (non-missing) values. 


SE of AM. The standard error of the mean is the standard deviation divided by the 
square root of the sample size. It is the estimation error, or the average deviation of 
sample means from the expected value of a variable. 


СІ of AM. Endpoints for the confidence interval of the mean. You can specify a 
confidence level for the confidence interval of the mean. Enter a value between 0 
and 1. (0.95 (default) and 0.99 are typical values). If the value is bigger than 1, it is 
treated as a percentage. 


Geometric mean. Computes the geometric mean for positive values. It is the n” 
root of the product of all non-missing n-entries. 


Harmonic mean. Calculates the harmonic mean for positive values. It is the 
number of elements to be averaged divided by the sum of the reciprocals of the 
elements. 


Trimmed mean. Calculates mean after trimming out the extreme observations. 
For two-sided cases (default) enter a value between 0 and 0.5 and for lower or for 
upper trimming enter a value between 0 and 1. The default value for all the cases 
is 0.10. Beware that for two-sided, each side is trimmed by the given proportion. 


Median. The median estimates the center of a distribution. If the data are sorted in 
increasing order, the median is the value above which half of the values fall. 


SD. Standard deviation, a measure of spread, is the square root of the sum of the 
squared deviations of the values from the mean divided by (n-1). 


CV. The coefficient of variation is the standard deviation divided by the sample 
mean. 


Range. The difference between the minimum and the maximum values. 


Variance. The mean of the squared deviations of values from the mean. (Variance 
is the standard deviation squared). 


Skewness. A measure of the symmetry of a distribution about its mean. If the 
skewness is significantly nonzero, the distribution is asymmetric. A significant 
positive value indicates a long right tail; a negative value, a long left tail. A 
skewness coefficient is considered significant if the absolute value of SKEWNESS 
/ SES is greater than 2. 
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SE of skewness. The standard error of skewness (SQR(6/n)). 
Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable 
has longer tails than those for a normal distribution; less than 0 indicates that the 


distribution is flatter than a normal distribution. A kurtosis coefficient is considered 
significant if the absolute value of KURTOSIS / SEK is greater than 2. 


SE of kurtosis. The standard error of kurtosis (SQR(24/n)). 


Shapiro-Wilk normality test. Computes the Shapiro-Wilk test statistic along with 
p-value. 
m Anderson-Darling normality test. Computes the Anderson-Darling test statistic 
along with its p-value. 
m Multivariate normality assessment. The following measures and tests of 
multivariate normality are available: 
uMardia skewness. Computes Mardia's skewness coefficient and tests its 
significance using an asymptotic distribution. 
=Mardia kurtosis. Computes Mardia's kurtosis coefficient and tests its 
significance using an asymptotic distribution. 
mHenze-Zirkler test. Computes the Henze-Zirkler test statistic and its associated 
p-value using lognormal distribution. 


N- & P-Tiles 


The median divides the data into two equal groups and hence the median could be 
called the 2-tile value; for the same reason, it could also be called the 50th percentile 
(50th P-tile). More descriptive information about the data is given by those numbers 
that divide the data into N equal parts, called N-tiles and P-tiles for various values of P 
from 0 to 100. There are many methods for computing these statistics, depending upon 
the assumptions about the nature of the distribution. 

SYSTAT computes N-tiles and P-tiles by seven different methods. SYSTAT also 
offers a transformation which, depending on the choice of N, classifies a given number 
into one of the (N) N-tile classes. This transformation can be applied to an input 
number or to an entire column of variate values. For example, if the number of groups 
is 5, the transformation returns à "1" for the observations or numbers in the lowest fifth, 
a '2' for the observations or numbers in the second lowest fifth, and so on. 


To request N-tiles, P-tiles, and Classification (transformation) for a Column, click the 
N- & P-Tiles tab in the Basic Statistics dialog box. 
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Analyze: Basic Statistics 


Main | № &P-Tiles | Resampling) 


Мале [ ] 

[У] Percentiles |1 5 10 20 25 30 40 50 60 70 75 80 90 35 99 
Method 

Па 

Cleveland О Weighted average 2 

E Weighted average 1 Е Empirical CDF (average) 


[0 Closest Cl Weighted average 3 
E Empirical СОЕ 


Classify 


| Available variable(s]: Selected variable(s]: 
POP_1983 | 

РОР. 1386 
РОР 1990 --- 
РОР 2020 | <- Remove | 


1IDDAM 


Add-» | 


N-tiles. Values that divide a sample of data into N groups containing (as far as possible) 
equal numbers of observations. The output gives the N-1 intermediate points. 


Percentiles. Values that divide a sample of data into one hundred groups containing (as 
far as possible) equal numbers of observations. 


Method. Let л represent the number of non-missing values for the selected variable, 
and let хү), x»), «++» X(n) represent its ordered values, хүр = xy) ) ANd хе + 1) = X(n). Let 
P denote the pth percentile. Write: 

L(np) =1+ F 

P=W,x,+ хі») + W3xq4 2) 
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where / is the integer part of L(n,p) and F represents the fractional part of L(n,p). 
Different methods use different expressions for L(n,p) and weights W;, W2, and W3. 
The following methods are available: 


All. Calculates N-tiles and P-tiles using all seven methods. 


Cleveland. It is the default method; it uses the following: 
L(n,p) = (np/100) + 0.5, у= 1-Е, W =F, and W3=0 


Weighted average 1. Calculates weighted average at ху. This method uses the 
following: 

L(n,p) =np/100, W= 1-F, W =F, and W3=0 

Closest. Calculates the observation numbered closest to (np/100) and uses the 
following: 

L(n,p) =(np/100) + 0.5, W)= 1, W2=0, and W3=0 

Empirical CDF. This method uses the empirical distribution function. For this: 
L(n,p) =np/100, W, 1- Р, W =F, and W3=0, 

where d(F)= 0 if F=0 and =1 if F0. 

Weighted average 2. Calculates the weighted average aimed at observation closest 
to ху. For this: 

L(n,p) =(п+1)р/100, W;— 1-Е, Мҙ-Е, and W3=0 

Empirical CDF (average). Calculates the empirical distribution function with 
averaging. For this: 

L(n,p) -пр/100, W;= (1- Е)/2, Wy-(1* F/2), and W;-0 

Weighted average 3. Calculates the weighted average aimed at observation closest 


to Хан) For this: 
(ар) =(n-1)p/100, W= 0, W;-1-F, and W3=F 


Classify. The following options (one or both) can be specified: 


Selected variable(s). Selects the variable to be transformed based on the 
computation requested. 


Value(s). Specify the value(s) you want to classify. Type entries separated by 
spaces; separate the value(s) by а semicolon if N-tiles are requested for more than 


one variable. 


The output appears in the Output Pane in both cases. 
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Saving Basic Statistics to a File 


If you are saving statistics to a file, you must select the format in which the statistics 
are to be saved: 


= Variables. Use with a By Groups variable to save selected statistics to a data file. 
Each selected statistic is a case in the new data file (both the statistic and the 
group(s) are identified). The file contains the variable STA TISTICS$ identifying the 
statistics. 


Ш Aggregate. Saves aggregate statistics to a data file. For each By Groups category, 
a record (case) in the new data file contains all requested statistics. Three characters 
are appended to the first eight letters of the variable name to identify the statistics. 
The first two characters identify the statistic. The third character represents the 
order in which the variables are selected. The statistics correspond to the following 
two-letter combinations: 


N of cases NU Kurtosis KU 
Minimum MI SE Kurtosis EK 
Maximum MA Shapiro-Wilk statistic WS 
Sum SU Shapiro-Wilk p-value WP 
Arithmetic Mean MEAN Anderson Darling statistic AD 
CI Upper CU Adjusted Anderson-Darling statistic AA 
CI Lower CL Anderson Darling p-value AP 
Geometric mean GM Mardia’s skewness coefficient МІ 
Harmonic mean HM Mardia’s skewness based statistic М2 
Trimmed mean TM Mardia's skewness p-value M3 
Median MD Mardia's kurtosis coefficient M4 
Std. Error SE Mardia's kurtosis based statistic MS 
Std. Deviation SD Mardia's kurtosis p-value M6 
Variance VA Henze-Zirkler statistic HS 
C.V. СУ Henze-Zirkler p-value HP 
Range RA N-tile NI 
N2 
N3 
Skewness SK Percentile PI 
P2 
P3 


SE Skewness ES 
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The saving option for N-tiles and P-tiles saves the results of the first method among the 
selected methods. 


Resampling 


Click the Resampling tab to specify different resampling options. 


Analyze: Basic Statistics 


_ мап | N-&P-Ties| Resampling 


Perform resampling 


| Method: Bootstrap 


Number of samples: 


Sample size: 


Random seed: 


Confidence: 
Estimates 
Mean 
Median 
oso 


amples of cases and uses data thereof to carry out the 


Perform resampling. Generates 5 
same analysis on each sample. 


Method. Three sampling methods are available: 
m Bootstrap. Generates bootstrap samples. This is the default method. 


m Without replacement. Generates subsamples without replacement. 


1-314 
Chapter 9 


= Jackknife. Generates jackknife samples. 


Number of samples. Specify the number of samples to be generated. These samples are 
analyzed using the chosen method of sampling. The default is 1. 


Sample size. Specify the size of each sample to be generated while resampling. The 
default sample size is the number of cases in the data file in use. 


Random seed. Specify a random seed to be used while resampling. The default random 
seed is generated by the system. 


Confidence. Specify a confidence level for bootstrap-based confidence interval. Enter 
any value between 0 and 1. The default value is 0.95. 


Estimates. Specify the parameters for which you desire resampling estimates. 


Stem-and-Leaf Plot Dialog Box 


To open the Stem-and-Leaf Plot dialog box, from the menus choose: 


Analyze 
Stem-and-Leaf... 


Analyze: Stem-and-Leaf 


<- Remove 
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Stem creates a stem-and-leaf plot for one or more variables. The plot shows the 
distribution ofa variable graphically. Іп a stem-and-leaf plot, the digits of each number 
are separated into a stem and a leaf. The stems are listed as a column on the left, and 
the leaves for each stem are in a row on the right. Stem-and-leaf plots also list the 
minimum, lower-hinge, median, upper-hinge, and maximum values of the sample. 
Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of 
the leaves. 

The stem-and-leaf plot is useful for assessing distributional shape and identifying 
outliers. Values that are markedly different from the others in the sample are labeled as 
outside values—that is, the value is more than 1.5 hspreads outside its hinge (the 
hspread is the distance between the lower and upper hinges, or quartiles). Under 
normality, this translates into roughly 2.7 standard deviations from the mean. 


The following must be specified to obtain a stem-and-leaf plot: 

ш Selected variable(s). A separate stem-and-leaf plot is created for each selected 
variable. 

m Number of lines. You can indicate how many lines (stems) to include in the plot. 
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Basic Statistics for Rows 


To open the Row Statistics dialog box, from the menus choose: 


Analyze 
Row Statistics 
Basic Statistics... 


Analyze: Row Statistics: Basic Statistics 

Мап | N-& P-Ties| Resampling 

Available variable(s): Selected variable(s]: 
Вом(1) 
Row(2) 
Row(3] 


Row(4] 
Row(5] 


Daudt 
Options 
MIN C] Geometric mean (GM) O Range 
[У] Minimum (Harmonic mean (НМ) [Variance 
Maximum [Г] Trimmed mean: [0 Skewness 
Sum TusSided" ~ | СОЕ of skewness 
[V] Arithmetic mean (АМ) [C] Median - шаға 
ГО ЗЕ of AM [050 [Г] SE of kurtosis 


Паолм(05: | Chev 


[0 Shapiro-Wilk normality test [Г] Anderson-Darling normality test 
Multivariate normality assessment 
О Mardia skewness [Г] Mardia kurtosis (C)HenzeZirkler test 


О Save statistics Variables 


The options available here are the same as in Basic Statistics. 
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N- & P-Tiles 


To request N-tiles, P-tiles, and Classification (Transformation) for Rows, click the 
N- & P-Tiles tab in the Row Statistics dialog box. 


Analyze: Row Statistics: Basic Statistics 


| Main | N- & P-Tiles | Re: 


N-tiles 


[V] Percentiles 15102025 30 40 50 60 70 75 80 90 95 99 


| Method 

| [а 

| Cleveland [E] Weighted average 2 
E Weighted average 1 [E Empirical CDF (average) 
[Г] Closest [Weighted average 3 
[C Empirical СОЕ 


Classify 


Selected variable[s) 


Add > 


<- Remove 


The options available here are the same as in Basic Statistics . 
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Saving Basic Statistics for Rows to a File 


If you are saving statistics for rows to a file, you must select the format in which the 
statistics are to be saved: 


m Row(s). SYSTAT saves selected statistics to а data file. Each selected statistic is a 
variable in the new data file. 


Ш Aggregate. Saves aggregate statistics as a case іп the new data file. More details 
can be found in Saving option for Basic statistics. 


Resampling 


Click the Resampling tab to specify different resampling options. 


Analyze: Row Statistics: Basic Statistics 


[ Main [ N-t Ties] Resanping | 

Perform resampling 

Method: [Bootstrap | 
Number of samples: пуке. 
Sample size: 
Random seed: 


Confidence: 
Estimates 
[V] Mean 


[7] Median 
oso 
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Perform resampling. Generates samples of cases and uses data thereof to carry out the 
same analysis on each sample. 

Method. Three sampling methods are available: 

m Bootstrap. Generates bootstrap samples. This is the default method. 

m Without replacement. Generates subsamples without replacement. 

m Jackknife. Generates jackknife samples. 

Number of samples. Specify the number of samples to be generated. These samples are 
analyzed using the chosen method of sampling. The default is 1. 


Sample size. Specify the size of each sample to be generated while resampling. The 
default sample size is the number of cases in the data file in use. 


Random seed. Specify a random seed to be used while resampling. The default random 
seed is generated by the system. 


Confidence. Specify a confidence level for bootstrap-based confidence interval. Enter 
any value between 0 and 1. The default value is 0.95. 


m Estimates. Specify the parameters for which you desire resampling estimates. 
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Row Stem-and-Leaf Plot Dialog Box 


To obtain Stem-and-Leaf plot for rows, from the menus choose: 


Analyze 
Row Statistics 
Stem-and-Leaf... 


Analyze: Row Statistics: Stem-and-Leaf ЕДЕЗ 


Маіп Resampling) 


Available variable(s]: Selected variable{s}: 
Row(1] 
Row(2) 
Row(3] 
Row(4) <- Remove | 
Row(5) 


Daci 


Add -- 


Number of lines: 


The options available here are the same as in Stem-and-Leaf. 
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To open the Cronbach dialog box, from the menus choose: 


Analyze 
Correlations 
Cronbach's Alpha... 


Analyze: Correlations: Cronbach's Alpha 


Available ` variable(s}: Selected variable(s): 


рар 193: Ж 

POP 1988 Ж | 

POP. 1990 pee 
РОР 2020 [«-Bemove | 


Descriptive Statistics 


Cronbach computes Cronbach’s alpha. This statistic is a lower bound for test reliability 
and ranges in value from 0 to 1 (negative values can occur when items are negatively 
correlated). Alpha can be viewed as the correlation between the items (variables) 
selected and all other possible tests or scales (with the same number of items) 


constructed to measure the characteristic o 


alpha is: 
k x avg(cov) 
AA OP АЕ 
буй avg(var) 


avg(var) 


f interest. The formula used to calculate 
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where k is the number of items, avg(cov) is the average covariance among the items, 
and avg(var) is the average variance. Note that alpha depends on both the number of 
items and the correlations among them. Even when the average correlation is small, the 
reliability coefficient can be large if the number of items is large. 


The following must be specified to obtain a Cronbach’s alpha: 


= Selected variable(s). To obtain Cronbach’s alpha, at least two variables must be 
selected. 


Using Commands 


Note: STATS module is now a Global module, and hence the command STATS is 
obsolete. You can use CSTATISTICS and RSTATISTICS commands within any other 
module. 


To generate descriptive statistics, choose your data by typing USE filename, and 
continue with: 


CLSTEM (or RWSTEM) argument / LINES = n SAMPLE=BOOT (m, n) 


JACK 
SIMPLE (m, n) 
CRONBACH varlist / SAMPLE = BOOT(m,n) JACK SIMPLE (m,n) 
SSAVE / AG 
CSTATISTICS (or RSTATISTICS) argument / options SAMPLE- 
BOOT (m, n) 
JACK 
SIMPLE (m, n) 
where the argument can be: 


no argument 
varlist or / ROWS- rowlist (rowlist or / COLUMNS - varlist) 


varlist / ROWS - rowlist (rowlist / COLUMNS - varlist) 


and where the options can be one or more from the following: 


ALL N MIN MAX SUM 

MEAN SEM CIM CONFI-n TRMEAN=p 
TREGION- Two- СМЕАМ RANGE 

Жі orar VARIANCE SWTEST 
HMEAN MEDIAN SES SEK MSKEWNESS 
SD Су ADTEST SKEWNESS MKURTOSIS 


HZTEST NTILE=n PTILE=n1 n2 n3 .. 
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For NTILE and PTILE, you can select one or more methods of the computation: 


METHOD= АМ CLEVELAND WTDAVG1 CLOSEST 
EMPCDF WTDAVG2 EMPCDFAVG WTDAVG3 


In addition, NTILE offers classification of selected variable(s) and/or specified value(s) 
based on N-tile computations. You can specify: 


CLASSIFY = varlist or rowlist 


(Specify varlist or rowlist depending on whether CSTATISTICS or RSTATISTICS is 
requested.) 


DATA- у11 y12 у13 .; ү21 y22 y23 ~; y31 y32 y33 ..5.- 
For getting summarized resampling output, the following command should be given 
before the CSTATISTICS (or RSTATISTICS) command. 

SAMPLE BOOT(m,n) or SIMPLE (m,n) or JACK / MEAN MEDIAN SD 


VARIANCE SKEWNESS 
KURTOSIS CONFI=c 


Usage Considerations 


Types of data. Basic Statistics uses only numeric data. 

Print options. The output is standard for all PLENGTH options. 

Quick Graphs. Basic Statistics does not create Quick Graphs for any of its commands 
except MNTEST. 

Saving files. SSAVE with CSTATISTICS or RSTATISTICS saves basic statistics as either 
records (cases) or as variables. 

BY groups. Basic Statistics analyzes data by groups only for columns. 

Case frequencies. Basic Statistics uses the FREQ variable, if present, to duplicate cases 
only for columns. For multivariate normality assessment tests FREQUENCY is ignored. 


Case weights. Basic Statistics uses the WEIGHT variable only for column(s), if present, 
to weight cases. However, CLSTEM and RWSTEM are not affected by the WEIGHT 
variable. WEIGHT is available only for the method “Empirical CDF’. Weight cases is not 
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relevant іп the computation of TRMEAN, GMEAN and HMEAN . WEIGHT is ignored for 
multivariate normality assessment tests. 


Examples 


Example 1 
Basic Statistics 


This example uses the OURWORLD data file, containing one record for each of 57 
countries, and requests the default set of statistics for BABYMORT (infant mortality), 
СМР 86 (gnp per capita in 1986), LITERACY (percentage of the population who сап 
read), and POP_1990 (population, in millions, in 1990). 

The Statistics procedure knows only that these are numeric variables—it does not 
know if the mean and standard deviation are appropriate descriptors for their 
distributions. In other examples, we learned that the distribution of infant mortality is 
right-skewed and has distinct subpopulations, the GNP is missing for 12.396 of the 
countries, the distribution of LITERACY is left-skewed and has distinct subgroups. and 


a log transformation markedly improves the symmetry of the population values. This 
example ignores those findings. 


The input is: 


USE OURWORLD 
CSTATISTICS babymort gnp 86 literacy pop 1990 


The output is: 
| BABYMORT GNP 86 LITERACY POP 1990 
un E aM аас... 
N of Cases i 57 50 57 57 
Minimum i 5.000 120.000 11.600 0.263 
Maximum ~ i 154.000 17680.000 100.000 152.505 
Arithmetic Mean i 48.140 4310.800 73.563 22.800 
Standard Deviation | 47.236 4905.877 29.765 30.366 


For each variable, SYSTAT prints the number of cases (N of cases) with data present. 
Notice that the sample size for GNP_86 is 50, or 7 less than the total observations. For 
each variable, Minimum is the smallest value and Maximum, the largest. Thus, the 
lowest infant mortality rate is 5 deaths (per 1,000 live births), and the highest is 154 
deaths. In a symmetric distribution, the mean and median are approximately the same. 
The median for POP_1990 is 10.354 million people (see the stem-and-leaf plot 
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example). Here, the mean is 22.8 million—more than double the median. This estimate 
of the mean is quite sensitive to the extreme values in the right tail. 

Standard Deviation measures the spread of the values in each distribution. When the 
data follow a normal distribution, we expect roughly 95% of the values to fall within 
two standard deviations of the mean. 


Trimmed Mean 


The Arithmetic mean (AM) for POP_1990 is quite sensitive to its right tail; it may be 
more relevant if we trim out the extreme observations on the right to have a better idea 
of the average of POP_1990. Notice that the populations in Pakistan, Bangladesh and 
Brazil are vast compared to other countries thus influencing the AM significantly. 
When we trim out the upper 5% of the data, observations for these countries are 
eliminated and the mean is computed from the rest. 


The input is: 


USE OURWORLD 
CSTATISTICS POP_1990 / TRMEAN = 0.05 TREGION = UPPER 


The output is: 
| POP 1990 


Trimmed Mean (5$, Upper) Н 16.926 
No. of Observations Trimmed Out | 3 


The AM for the whole data is 22.8 and the median is 10.354. Upper trimming of 5% of 
observations reduces the mean to 16.926 and the median to 9,969; i.e., after trimming 
the mean tends to be closer to the median. Thus, trimming out the population of 
Pakistan, Bangladesh and Brazil helps to obtain a better measure of the center.Basic 
Statistics for Selected Row(s) 


It is worth requesting basic statistics separately for different group of countries. We 
wish to compare European and Islamic countries. Notice that the first 20 cases belong 
to European countries, the next 16 to Islamic and the last 21 to New World countries. 
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The input is: 
USE OURWORLD 
CSTATISTICS babymort gnp_86 literacy pop_1990/ROWS =, 
row(1)..row(20) А 
CSTATISTICS babymort апр 86 literacy рор 1990/ ROWS =, 
гом (21). .ком (36) 
Тће ошрш 15: 
| BABYMORT GNP_86 LITERACY РОР 1990 
a ee pS ERE ысыла ect Eph em so РГЕ SA, cii 
N of Cases i 20 18 20 20 
Minimum ) 5.000 2020.000 83.000 3.500 
Maximum i 15.000 17680.000 100.000 62.168 
Arithmetic Mean : 7.800 8911.667 97.550 21.958 
Standard Deviation } 3.037 4692.581 3.804 21.018 
| BABYMORT GNP_86 LITERACY РОР 1990 
N of Cases 16 12 16 16 
Minimum ! 30.000 120.000 11.600 0.848 
Maximum ! 154.000 2590.000 70.000 118.433 
Arithmetic Mean ! 105.563 678.333 35.188 30.495 
Standard Deviation | 34.595 778.983 19.520 37.134 
Example 2 
. 
Geometric Mean 


This example uses the GDP data file, containing GDP growth rates in India for nine 
different sectors such as agriculture, manufacturing etc. and OVERALL_GDP for the 
years 1997-98 to 2004-05. Since GDP growth is expressed as a percentage of the 
previous year’s GDP, we use the geometric mean of (100+GDP) for computing 


average GDP growth. 
The input is: 
USE GDP 
LET OGDP = (100 + OVERALL GDP) 


LET MGDP = (100 + MANUFACTURE) 
CSTATISTICS OGDP MGDP / GMEAN 


The output is: 


Geometric Mean | 110.804 109.490 


1-327 


Descriptive Statistics 


Example 3 
Harmonic Mean 


We consider the example of COFFEE taken from Hand et al. (1996). The prices of 100 
grams (gm) of coffee of the same brand in 15 different shops and the amount of coffee 
(in gm) per pence (GM_PER_PENCE) ate given. Our interest is to find the average 
amount of coffee (in gm) of that brand sold in the market per pence. 


The input is: 


USE COFFEE 
CSTATISTICS GM_PER_PENCE / HMEAN 


The output is: 


| GM PER PENCE 


1222 тым мН ен тт ston 


If one buys 100 gm each from these 15 shops, the total amount of coffee bought is 
15*100 and total cost is: [cost for buying 100gm in shop 1 = 100/ (СМ PER PENCE 
for shop 1)] + [cost for buying 100gm in shop 2 = 100/ (GM_PER_PENCE for shop 
2y] ғат + [cost for buying 100gm in shop one = 100/ (GM_PER_PENCE for shop 
15)]. The average amount of coffee of that brand per PENCE is then given by [total 


amount of coffee] / [total cost]. 
This is actually the harmonic mean (not AM or GM), which is 0.995. 


Example 4 $ у 
Saving Basic Statistics: One Statistic and One Grouping Variable 


For European, Islamic, and New World countries, we save the median infant mortality 
rate, gross national product, literacy rate, and 1990 population using the OURWORLD 


data file. 


The input is: 


USE OURWORLD 

BY group$ 

SSAVE mystats 

CSTATISTICS babymort gnp 86 literacy pop 1990 /N MEDIAN 


BY 
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The text results that appear on the screen are shown below (they can also be sent to a 


text file). 
The output is: 
Results for GROUP$ = Europe 
1 BABYMORT GNP_86 LITERACY РОР 1990 
^W of Caley ped a piu "D o gm HET 207 V TUN. 20 
Median | 6.000 9610.000 99.000 10.462 


Results for GROUP$ = Islamic 


| BABYMORT GNP_86 LITERACY POP_1990 
ee fen mmm eae nn ee Snes Шы Шыт 
М of Cases | 16 12 16 16 
Median | 113.000 335.000 28.550 16.686 


Results for GROUP$ = NewWorld 


| BABYMORT GNP_86 LITERACY POP_1990 
М of Cases | 21 20 21 21 
Median Н 32.000 1275.000 85.600 7.241 


Тһе MYSTATS data file (created in the SAVE step) is shown below: 


Case GROUPS — STATISTICS$ BABYMORT GNP_86 LITERACY РОР 1990 


1 Europe NofCass 20 18 20 20 

2 Europe Median 6 9610 99 10.462 
3 Islamic N of Cases 16 12 16 16 

4 Islamic Median 113 335 28.550 16.686 
$ NewWorld М of Cases 21 20 21 21 

6 NewWorld Median 32 1275 85.600 7.241 


Use a statement such as this to eliminate the sample size records: 
SELECT statistics$ <> "N of cases" 


Example 5 
Saving Basic Statistics: Multiple Statistics and Grouping Variables 


If you want to save two or more statistics for each unique cross-classification of the 
values of the grouping variables, SYSTAT can write the results in two ways: 


m А separate record for each statistic. The values of a new variable named 
STATISTICSS$ identify the statistics, 
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m One record containing all the requested statistics. SYSTAT generates variable 
names to label the results. 


The first layout is the default; the second is obtained using: 
SSAVE filename / AG 


As examples, we save the median, arithmetic mean, and standard error of the 
arithmetic mean for the cross-classification of type of country with government for the 
OURWORLD data. The nine cells for which we compute statistics are shown below 
(the number of countries is displayed in each cell): 


Democracy Military Опе Party 


Europe 16 0 4 
Islamic 4 7 5 
New World 12 6 3 


Note the empty cell in the first row. We illustrate both file layouts—a separate record 
for each statistic and one record for all results. 


One record per statistic. The following commands are used to compute and save 
statistics for the combinations of GROUPS and GOV$ shown in the table above: 


USE OURWORLD 

BY group$ gov$ 

SSAVE mystats2 
CSTATISTICS babymort gnp 
BY 


86 literacy pop 1990 /N MEDIAN MEAN SEM 
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The MYSTATS2 file with 32 cases and seven variables is shown below: 


Case GROUP$ 


1 Europe 
2 Europe 
3 Europe 


4 Europe 
5 Europe 


Europe 
7 Europe 


8 Europe 


9 Islamic 
10 Islamic 
11 Islamic 


12 Islamic 


13 Islamic 
14 Islamic 
15 Islamic 


16 Islamic 


17 Islamic 
18 Islamic 
19 Islamic 


20 Islamic 


GOVS 


Democracy 
Democracy 
Democracy 


Democracy 


OneParty 
OneParty 
OneParty 


OneParty 


Democracy 
Democracy 
Democracy 


Democracy 


OneParty 
OneParty 
OneParty 


OneParty 
Military 
Military 
Military 


Military 


21 NewWorld Democracy 
22 NewWorld Democracy 


STATISTICS BABYMORT GNP_86 LITERACY POP_1990 


NofCases 16 
Median 6 
Arithmetic 6.875 
Mean 

Standard Error 0.547 
of Arithmetic 

Mean 

NofCases 4 
Median 12 
Arithmetic 11.500 
Mean 

Standard Error 1.708 
of Arithmetic 

Mean 

NofCases 4 
Median 97 
Arithmetic 91 
Mean 

Standard Error 23.083 
of Arithmetic 

Mean 

NofCases 5 
Median 116 
Arithmetic 109.800 
Mean 

Standard Error 15.124 
of Arithmetic 

Mean 

NofCases 7 
Median 116 
Arithmetic 110.857 
Mean 

Standard Error 11.801 
of Arithmetic 

Mean 

N of Cases 12 
Median 35 


16 16 
10005 99 
9770 97.250 


1057.226 1.055 


2 4 
2045 99 
2045 98.750 
25 0.250 
4 4 

370 29.550 
700 37.300 


378.660 9.312 


3 5 
280 18 
1016.667 29.720 


787.196 9,786 


5 7 
350 29 
458 37.886 


180.039 7.779 


12 12 
1645 86.800 


16 
9.969 
22.427 


5.751 


4 
15.995 
20.084 


6.036 


15.862 
15.355 


3.289 


15.102 
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23 NewWorld Democracy Arithmetic 44.667 2894.167 85.800 26.490 
Mean 

24 Мем ома Democracy Standard Error 9.764 1085.810 3.143 11.926 
of Arithmetic 
Mean 

25 NewWorld OneParty NofCases 3 2 3 3 

26 NewWorld OneParty Median 16 2995 98.500 2.441 

27 NewWorld OneParty Arithmetic 14.667 2995 90.500 4.441 
Меап 

28 NewWorld OneParty Standard Error 1.333 2155 8.251 3.153 
of Arithmetic 
Mean 

29 NewWorld Military NofCases 6 6 6 6 

30 NewWorld Military Median 55 780 60.500 5.726 

31 NewWorld Military Arithmetic 53.167 1045 63.000 6.886 
Mean 

32 NewWorld Military Standard Error 13.245 287.573 10.820 1.515 
of Arithmetic 
Mean 


The average infant mortality rate for European democratic nations is 6.875 (case 4), 
while the median is 6.0 (case 2). 


One record for all statistics. Instead of four records (cases) for each combination of 
GROUPS and GOV$, we specify AG (aggregate) to prompt SYSTAT to write one 
record for each cell: 

USE OURWORLD 

BY group$ gov$ 


SSAVE mystats3 / AG 
CSTATISTICS babymort gnp 86 literacy pop 1990 / N MEDIAN MEAN SEM 


BY 
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Тһе MYSTATS3 file, with 8 cases and 18 variables, is shown below (we separated them 
into three panels and shortened the variable names): 


CASE GROUPS GOVS NUIBABY MDIBABY MEIBABY . SEIBABY 
MORT MORT MORT MORT 

1 Ешгоре Democracy 16 6 6.875 0.547 

2 Europe OneParty 4 12 11.500 1.708 

3 Islamic Democracy 4 97 91 23.083 

4 Islamic OneParty 5 116 109.800 15.124 

5 Islamic Military 7 116 110.857 11.801 

6 New World Democracy 12 35 44.667 9.764 

7 New World ОпеРагіу 3 16 14.667 1.333 

8 NewWorld Military 6 55 53.167 13.245 
NU2GNP. 86 MD2GNP. 86 ME2GNP 86 SE2GNP 86 МОЗ MD3 

LITERACY LITERACY 

16 10005 9770 1057.226 16 99 
2 2045 2045 25 4 99 
4 370 700 378.660 4 29.550 
3 280 1016.667 787.196 5 18 
5 350 458 180.039 7 29 
12 1645 2894.167 1085.810 12 86.800 
2 2995 2995 2155 3 98.500 
6 780 1045 287.573 6 60.500 
МЕЗ SE3 NU4 MD4 ME4 SE4 
LITERACY LITERACY РОР 1990 POP 1990 POP 1990 POP 1990 
97.250 1.055 16 9.969 22.427 5.751 
98.750 0.250 4 15.995 20.084 6.036 
37.300 9.312 4 12.612 12.761 5.315 
29.720 9.786 5 15.862 15.355 3.289 
37.886 7.780 7 51.667 51.444 18.678 
85.800 3.143 12 15.102 26.490 11.926 
90.500 8.251 3 2.441 4.441 3.153 
63 10.820 6 5.726 6.886 1.515 


Note that there are no European countries with milita 


4 гу governments, so no record is 
written. 
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Example 6 
Stem-and-Leaf Plot 


We request robust statistics for BABYMORT (infant mortality), РОР 1990 (1990 
population in millions), and LITERACY (percentage of the population who can read) 
from the OURWORLD data file. 


The input is: 


USE OURWORLD 
CLSTEM babymort pop_1990 literacy 


The output is: 
Stem and Leaf Plot of Variable: BABYMORT, N= 57 


Minimum:5.000 
Lower Hinge:7.000 
Median:22.000 
Upper Hinge:74.000 
Maximum:154.000 


0 Н 5666666666677777 
1 00123456668 
2 м 227 

3 028 

4 9 

5 

6 11224779 
75% 

8 77 

9 

10 77 

11 066 

12 559 

13 6 

14 07 

15 4 


Stem and Leaf Plot of variable: POP_1990, N= 57 


Minimum:0.263 
Lower Hinge:6.142 
Median:10.354 
Upper Hinge:25. 567 
Maximum:152.505 


00122333444 
H 5556667777788899 
м 0000034 

556789 


лькоомәее-еоо 
ш 
o 
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* * * Outside Values * * * 
5 


6677 
6 2 
11 48 
15 2 


Stem and Leaf Plot of Variable: LITERACY, N = 57 


Minimum: 11.600 
Lower Hinge:55.000 
Median: 88.000 
Upper Hinge:99.000 
Maximum:100.000 


1258 
035689 
1 


= 


002556 
355 
0446 
03558 


ms 


03344457888889999999999999 
00 


ооочаольомн 
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In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf. 
The stems are listed as a column on the left, and the leaves for each stem are in a row 
on the right. For infant mortality (BABYMORT), the maximum number of babies who 
die in their first year of life is 154 (out of 1,000 live births). Look for this value at the 
bottom of the ВАВУМОВТ display. The stem for 154 is 15, and the leaf is 4. The 
minimum value for this variable is 5—its leaf is 5 with a stem of 0. 

The median value of 22 is printed here as the Median in the top panel and marked 
by an Min the plot. The hinges, marked by 77's in the plot, are 7 and 74 deaths, meaning 
that 25% of the countries іп our sample have a death rate of 7 or less, and another 25% 
have a rate of 74 or higher. Furthermore, the gaps between 49 and 61 deaths and 
between 87 and 107 indicate that the sample does not appear homogeneous. 

Focusing on the second plot, the median population size is 10.354, or more than 10 
million people. One-quarter of the countries have a population of 6.142 million or less. 
The largest country (Brazil) has more than 152 million people. The largest stem for 
РОР 1990 is 15, like that for BABYMORT. This 15 comes from 152.505, so the 2 is 
the leaf and the 0.505 is lost. 

The plot for POP 1990 is very right-skewed. Notice that a real number line extends 
from the minimum stem of 0 (0.623) to the stem of 5 for 51 million. The values below 
Outside Values (stems of 5, 6, 11, and 25 with 8 leaves) do not fall along a number line, 
so the right tail of this distribution extends farther than one would think at first glance. 

The median in the final plot indicates that half of the countries in our sample have 
a literacy rate of 8895 or better. The upper hinge is 99%, so more than one-quarter of 
the countries have a rate of 99% or better. In the country with the lowest rate (Somalia). 
only 11.6% of the people can read. The stem for 11.6 is 1 (the 10's digit), and the leaf 
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is 1 (the units’ digit). The 0.6 is not part of the display. For stem 10, there are two leaves 
that are 0- so two countries have 100% literacy rates (Finland and Norway). Notice the 
11 countries (at the top of the plot) with very low rates. Is there a separate subgroup 
here? 


Transformations 


Because the distribution of POP_1990 is very skewed, it may not be suited for analyses 
based on normality. To find out, we transform the population values to log base 10 units 


using the L10 function. 


The input is: 


USE OURWORLD 
LET 109рор90=110 (pop_1990) 
CLSTEM logpop90 


The output is: 
Stem and Leaf Plot of Variable: LOGPOP90, N = 57 


Minimum:-0.581 
Lower Hinge:0.788 
Median:1.015 
Upper Hinge:1.408 
Maximum: 2.183 


-0 5 
* * * Outside Values * * *2.183 
01 


0 
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1 M 00000111 
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i 
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For the untransformed values of the population, the stem-and-leaf plot identifies eight 
outliers. Here, there is only one outlier. More important, however, 18 the fact that the 
shape of the distribution for these transformed values is much more symmetric. 
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Subpopulations 


Here, we stratify the values of LITERACY for countries grouped as European, Islamic, 
and New World. 


The input is: 


USE OURWORLD 

BY group$ 

CLSTEM babymort pop 1990 literacy 
BY 


The output is: 


Results for GROUPS = Europe 
Stem and Leaf Plot of Variable: LITERACY, N - 20 


Minimum:83.000 

Lower Hinge:98.000 2 
Median:99.000 

Upper Hinge:99.000 

Maximum:100.000 


83 0 
Sy 
95.."0 
* * * Outside Values * * *100.000 
97%. 0 
97 
98 н 000 
98 
99 м 00000000000 
99 
100 00 


Results for GROUP$ = Islamic 
Stem and Leaf Plot of Variable: LITERACY, N = 16 


Minimum: 11.600 
Lower Hinge:19.000 
Median:28.550 
Upper Hinge:53.500 
Maximum: 70.000 
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Results for GROUP$ = NewWorld 
Stem and Leaf Plot of Variable: LITERACY, N = 21 


Minimum: 23.000 
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Lower Hinge:74.000 
Median: 85.600 
Upper Hinge: 94.000 
Maximum: 99.000 


К. 
* * Outside Values * * *99.000 
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H 44 
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H 03444 
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The output is listed only for the variable LITERACY. The literacy rates for Europe and 
the Islamic nations do not even overlap. The rates range from 83% to 100% for the 
Europeans and 11.6% to 70% for the Islamic. Earlier, 11 countries were identified that 
have rates of 31% or less. From these stratified results, we learn that 10 of the countries 
are Islamic and 1 (Haiti) is from the New World. The Haitian rate (23%) is identified 
as an outlier with respect to the values of the other New World countries. 


Stem-and-Leaf Plot for Selected Row(s) 


It is worth requesting stem-and-leaf for literacy in cities. Notice that 40 records are 
from cities. 


The input is: 


USE OURWORLD 
SORT urban$ 


CLSTEM literacy / ROWS = row(1) .. ком (40) 

The output is: 
Stem and Leaf Plot of variable: LITERACY, N = 40 

Minimum: 50.000 

Lower hinge: 85.300 

Median: 97.500 

Upper hinge: 99.000 

Maximum: 100.000 
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The distribution is right-skewed, even a log transform does not produce a symmetric 
shape. 
Example 7 
N-tiles and P-tiles 
We request N-tiles and P-tiles for the variables SCORES] and INCOME of the data file 
INCOME. 
The input is: 
USE INCOME 


CSTATISTICS SCORES1 INCOME / NTILE = 10 PTILE = 2 3 4 5 6 10 24, 
56 89 90 92 95 96 97 98 METHOD = cleveland 


The output is: 


9 NTILES requested 


жүзе en ee ee др a eer ener 
Method = CLEVELAND | 
2.000% i 0.000 0.000 
3.000% i 0.000 0.000 
4.000% i 0.140 0.000 
5.000% i 0.300 0.350 
6.000% Н 0.460 0.770 
10.000% i 1.200 2.450 
24.000% H 4.360 5.420 
56.000% i 45.380 12.748 
89.000% + 75.140 60.174 
90.000% i 76.900 69.690 
92.000% i 82.400 88.722 
95.000% + 92.000 117.270 
96.000% i 95.200 125.200 
97.000% | 98.000 125.200 
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98.000% 1 98.000 125.200 
1 of 10 i 1.200 2.450 
2 of 10 Н 3.000 3.860 
3 оғ 10 i 8.500 8.040 
4 of 10 ! 31.800 9.450 
5 of 10 ‚| 43.500 10.900 
6 of 10 ! 47.400 14.380 
7 ‘of 10 ‚| 54.500 30.430 
8 of 10 1 60.700 36.000 
9 of 10 ! 76.900 69.690 


Transformation of Variables and Specified Values 


Suppose we have requested N-tiles for a variable. SYSTAT produces the N-/ numbers 
separating (һе N intervals. Let us index the intervals from 1 to N starting from the left. 
Now suppose we want to classify the observations themselves and/or new values into 
1 to N depending upon which intervals they fall in. For instance, in the data file 
INCOME where the first column represents scores in a test, the students are divided 
into four N-tile groups on the basis of SCORES] using the method Empirical CDF 
(average), then the same scores as well as scores 48 and 65 are to be classified into 


these N-tile groups. 


The input is: 


USE INCOME 
CSTATISTICS SCORES1 / NTILE = 4 METHOD = EMPCDFAVG DATA = 48 65, 


CLASSIFY = SCORES1 


The output is: 
3 NTILES requested 
| SCORES1 
ссд лье Лао > арон 
Method = EMPCDFAVG | 
1 of 4 + 7.000 
2 of 4 | 44.000 
3of4 ! 58.000 


Classification of Data Matrix: Method = EMPCDFAVG 


SCORES1 GROUP 


Classification of variables: Method = EMPCDFAVG 


SCORES1 GROUP 
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Example 8 
Basic Statistics for Rows 


This example uses the SCORES data file, containing scores of 10 students in 14 
examinations. We wish to look at the overall performance of ше students. To see this 
we will compute the sum and mean for rows. 


The input is: 


USE SCORES 
RSTATISTICS /SUM MEAN 


The output is: 

Sum Arithmetic Mean 
ROW (1) 907.000 64.786 
ROW (2) 1036.000 74.000 
ROW (3) 762.000 54.429 
ROW (4) 858.000 61.286 
ROW (5) 920.000 65.714 
ROW (6) 826.000 59.000 
ROW (7) 1042.000 74.429 
ROW (8) 1021.000 72.929 
ROW (9) 834.000 59.571 
ROW (10) 825.000 58.929 


Row Statistics for Selected Variable(s) 
Now we request row statistics for the first 7 tests, and last 7 tests, separately. 


The input is: 


USE SCORES 
RSTATISTICS row(1)..row(10)/COLUMNS - testl..test7 SUM MEAN 
RSTATISTICS row(1)..row(10)/COLUMNS = test8..test14 SUM MEAN 
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The output 15: 

Sum Arithmetic Mean 
ROW (1) 407.000 58.143 
ROW (2) 471.000 67.286 
ROW (3) 328.000 46.857 
ROW (4) 359.000 51.286 
ROW (5) 402.000 57.429 
ROW (6) 352.000 50.286 
ROW (7) 478.000 68.286 
ROW (8) 476.000 68.000 
ROW (9) 396.000 56.571 
ROW (10) 374.000 53.429 


Sum Arithmetic Mean 


ROW (1) 500.000 71.429 
ROW (2) 565.000 80.714 
ROW (3) 434.000 62.000 
ROW (4) 499.000 71.286 
ROW (5) 518.000 74.000 
ROW (6) 474.000 67.714 
ROW (7) 564.000 80.571 
ROW (8) 545.000 77.857 
ROW (9) 438.000 62.571 
ROW (10) 451.000 64.429 
Example 9 


Normality Assessment Using Shapiro-Wilk and Anderson-Darling Test 


We can assess univariate normality using the Shapiro-Wilk and Anderson-Darling test. 
Here we use the FOREARM data to assess the normality for the HEIGHT variable. 


The input is: 


USE FOREARM 
CSTATISTICS HEIGHT / SWTEST ADTEST 


The output is: 
| HEIGHT 
NEMUS S m mana on +=------ 
Shapiro-Wilk Statistic ! 0.968 
Shapiro-Wilk p-value P ! 0.433 
Anderson~Darling Statistic ! 0.502 
Adjusted Anderson-Darling Statistic i EIS 


p-value 


Normality of HEIGHT is strongly supported by p-values for the Shapiro-Wilk and 
Anderson-Darling test. 


1-342 
Chapter 9 


Ехатріе 10 
Stem-and-Leaf Plot for Rows 


We request a stem-and-leaf plot of the best and the worst student according to scores. 
The input is: 


USE SCORES 
RWSTEM row(7) row(3) 


The output is: 
Stem and Leaf Plot of Variable: ROW(7), N = 14 
Minimum : 32.000 
Lower Hinge : 70.000 
Median : 76.500 
Upper Hinge : 85.000 
Maximum : 89.000 


3 "2 
* * * Outside Values * * * 89.000 
2 


ovo 
л 


7 
7 
8 44 
8 


Stem and Leaf Plot of Variable: ROW(3), N = 14 


Minimum : 9.000 
Lower Hinge : 42.000 
Median : 62.000 
Upper Hinge : 65.000 
Maximum : 71.000 


M 02223459 
11 


Row Stem-and-Leaf for Selected Variable(s) 


We request a stem-and-leaf plot for the same students for the first and last 7 tests 
separately. 


The input is: 


USE SCORES 
RWSTEM row(7) row(3)/COLUMNS = testl..test7 
RWSTEM row(7) row(3)/COLUMNS = test8..test14 
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The output 18: 
Stem and Leaf Plot of Variable: ROW(7), N= 7 
Minimum : 32.000 
Lower Hinge : 63.500 
Median : 70.000 
Upper Hinge : 80.000 
Maximum : 89.000 
3,5932, 
+ * + Outside Values * * * 89.000 
6H2 
6 1,4 
тмо 
7 5 
8H 
6 59 


Stem and Leaf Plot of Variable: ROW(3), N= 7 


Minimum : 9.000 
Lower Hinge : 31.000 
Median : 60.000 
Upper Hinge : 64.000 
Maximum : 69.000 
On 29 
1 
239 5 
321 
4 
5 
6 M 0359 


Stem and Leaf Plot of Variable: ROW(7), N = т 


Minimum : 73.000 
Lower Hinge : 76.500 
Median 84.000 
Upper Hinge 84.500 
Maximum : 85.000 
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Stem and Leaf Plot of Variable: ROW(3), N = 7 


Minimum : 42.000 
Lower Hinge : 62.000 
Median : 62.000 
Upper Hinge : 67.500 
Maximum : 71.000 
4 2 
* * * Qutside Values * * * 71.000 
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Algorithms 


SYSTAT uses a one-pass provisional algorithm (Spicer, 1972). Wilkinson and Dallal 
(1977) summarize the performance of this algorithm versus those used in several 
statistical packages. Finite sample adjustment for the Anderson-Darling test statistic is 
due to Stephens (1982) and for p-value computation, refer to Nelson (1998). 
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Design of Experiments 


Herb Stenson 


Design of Experiments (DOE) generates design matrices for a variety of ANOVA and 
mixture models. You can use Design of Experiments as both an online library and a 
search engine for experimental designs, saving any design to a SYSTAT file. You can 
run the associated experiment, add the values of a dependent variable to the same file, 
and analyze the experimental data by using General Linear Model (or another 
SYSTAT statistical procedure). 

SYSTAT offers three methods for generating experimental designs: Classic DOE, 
the DOE Wizard, and the DESIGN command. 


ш Classic DOE provides a standard dialog interface for generating the most popular 
complete (full) and incomplete (fractional) factorial designs. Complete factorial 
designs can have two or three levels of each factor, and the number of factors are 
limited upto seven factors. Incomplete designs include: Latin square designs with 
2 to 12 levels per factor; selected two-level designs described by Box, Hunter, and 
Hunter (1978) with 2 to 11 factors and from 4 to 128 runs; 13 of the most popular 
Taguchi (1987) designs; all of the Plackett and Burman (1946) two-level designs 
with 4 to 100 runs; the 6 three, five, and seven-level designs described by Plackett 
and Burman; and the set of 10 three-level designs described by Box and Behnken 
(1960) in both their blocked and unblocked versions. In addition, the Lattice, 
Centroid, Axial, and Screening mixture designs can be generated. The number of 
factors (components of a mixture) can be as large as your computer's memory 


allows. 
m The DOE Wizard provides an alternative interface consisting of a series of 


questions defining the structure of the design. The Wizard offers more designs 


than Classic DOE, including response surface and optimal designs. Optimization 


methods include the Fedorov, k-exchange, and coordinate exchange algorithms 
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with three optimality criteria available. The coordinate exchange algorithms 
accommodate both continuous and categorical variables. The search algorithms for 
fractional factorial designs allow any number of levels for any factor and search for 
orthogonal, incomplete blocks if requested. The number of factors for factorial, 
central composite, and optimal designs is restricted only by your computer's 
memory. 


W The DESIGN command generates all designs found in Classic DOE using 
SYSTAT's command language. 


Designs can be replicated as many times as you want, and the runs can be randomized. 


Statistical Background 


The Research Problem 


As an investigator interested in solving problems, you are faced with the task of 
identifying good solutions. You do this by using what you already know about the 
problem area to make a judgment about the solution(s). If you possess in-depth process 
knowledge, then there is little work to be done; you simply apply that knowledge to the 
problem at hand and derive a solution. 

More common is the situation in which you have limited knowledge about the 
factors involved and their interrelationships, so that any conjecture would be quite 
uncertain and far from optimal. In these situations, the first step would be to enhance 
your knowledge. This is usually done by empirical investigation—that is, by 
systematically observing the factors and how they affect the outcome of interest. The 
results of these observations become the data in your study. 

Process problems usually have factors, or variables, that may affect the outcome, 
and responses that measure the outcome of interest. The basic problem-solving 
approach is to develop a model that helps you understand the specific relationships 
between factors and responses. Such a model allows you to predict which factor values 
will lead to a desired response, or outcome. These empirical data provide the statistical 
basis used to generate models of your process. 
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Types of Investigation 


You can think of any empirical investigation as falling into one of two broad classes: 
experiment or observational study. The two classes have different properties and are 
used to approach different types of problems. 


Experiments 


Experiments are studies in which the factors are under the direct control of the 
experimenter. That is, the experimenter assigns certain values of the factors to each 
run, or observation. The response(s) are recorded for each chosen combination of 
factor levels. 

Because the factors are being manipulated by the experimenter, the experimenter 
can make inferences about causality. If assigning a certain temperature leads to a 
decrease in the output ofa chemical process, you can be fairly certain that temperature 
really did cause the decrease because you assigned the temperature value while holding 
other factors constant. 

Unfortunately, experiments do have a drawback in that there are some situations in 
which it is either impossible or impractical, or even unethical, to exercise control over 
the factors of interest. In those situations, an observational study must be used. 


Observational Studies 


Observational studies use only minimal, if any, intervention by the observer on the 
process. The observer merely observes and records changes in the response as the 
factors undergo their natural variation. No attempt is made to control the factors. 
Because the factors are not under the control of the experimenter, observational 
studies are very limited in their ability to explain causal relationships. For example, 
suppose you observe that shoe size and scholastic achievement show a strong 
relationship among school children, can you infer that larger feet cause achievement? 
Of course not. The truth of the matter is that both variables are most likely caused by 
a third (unmeasured) variable—age. Older students have larger feet, and they have 
been in school longer. If you could have some control over shoe size, you could make 
sure that shoe sizes were evenly distributed across students of different ages, and you 
would be in a much better position to make inferences about the causal relationship 


between shoe size and achievement. 
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But of course it is silly to speak of controlling shoe size, since you cannot change 
the size of people’s feet. This illustrates the strength of observational studies—they can 
be employed where true experimental studies are impossible, for either ethical or 
practical reasons. 

Because the focus of this chapter is the design and analysis of experimental studies, 
further references to observational studies will be minimal. 


The Importance of Having a Strategy 


Controlling the factors in an experiment is only the beginning of effective experimental 
research. Once you determine that you have a problem that can be addressed by 
experimentation, you need to answer other crucial questions: what will your 
experiment look like? What levels of which factors will you measure? How will you 
analyze the results to convert your data to knowledge? These are the questions that 
SYSTAT can help you answer. 

Careful planning of your experiment will give you many advantages over a poorly 
designed, haphazard approach to data collection. As Box, Hunter and Hunter (1978) 
point out: 

Frequently conclusions are easily drawn from a well-designed experiment, even 
when rather elementary methods of analysis are employed. Conversely, even the most 
sophisticated statistical analysis cannot salvage a badly designed experiment (p. vii.). 


Completeness 


By using a well-designed experiment, you will be able to discover the most important 
relationships in your process. Lack of planning can lead to incomplete designs that 
leave certain questions unanswered, confounding that causes confusion of two or more 


effects so that they become statistically indistinguishable, and poor precision of 
estimates, 


Efficiency 


Carefully planned experiments allow you to get the information you need at a fraction 
of the cost of a poorly planned design. Content knowledge can be applied to select 
specific effects of interest, and your experimental runs can be targeted to answer just 
those effects. Runs are not wasted on testing effects you already understand well. 
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Insight 


A well-designed experiment allows you to see patterns in the data that would be 
difficult to spot in a simple table of hastily collected values. The mathematical model 
you build based on your observations will be more reliable, more accurate, and more 
informative if you use well-chosen run points from an appropriate experimental 
design. 


The Role of Experimental Design in Research 


Experimental design 18 the interface between your question and the “real world”. The 
design tells you how much data you will need to collect, what factor levels to use for 
the run points, and how to analyze the results to get a useful model of the process. The 
model you derive from your experiment can then be applied to the problem at hand, 
enhancing your knowledge and allowing you to confidently formulate a solution. 
Тһе figure below illustrates the flow of knowledge in experimental research. Notice 
that the diagram is circular— you start with some knowledge, formulate a research 
question, and perform the research; then the knowledge you gained from the research 
is used to formulate new research questions, new designs, and so on. As you go through 
the iterations, you should find that your information increases in both quantity and 


quality. 


Prior 
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Types of Experimental Designs 


tal designs, each of which addresses a different 


There is a wide variety of experimen 
gns tend to fall into broad classes, which can be 


type of research problem. These desi 
summarized as follows: 
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= Factorial designs. These designs аге used to identify important effects in your 
process. 


ш Response surface designs. These designs are useful when you want to find the 
combination of factor values that gives the highest (or lowest) response. 


m Mixture designs. These designs are useful when you want to find the ideal 
proportions of ingredients for a mixture process. Mixture designs take into account 
the fact that all the component proportions must sum (01.0. 


= Optimal designs. These designs are useful when you have enough information | 
available to give a very detailed specification of the model you want to test. 
Because optimal designs are very flexible, you can use them in situations where no | 
standard design is available. Optimal designs are also useful when you want to 
have explicit control over the type of efficiency maximized by the design. | 


Factorial Designs 


In investigating the factors that affect a certain process, the basic building blocks of 
your investigation are observations of the system under different conditions. You vary 
the factors under your control and measure what happens to the outcome of the process. 
The naive inquirer might use a haphazard, trial-and-error approach to testing the 
factors. Of course, this approach can take a long time and many observations, or runs, 
to give reasonable results (if it does at all), and, in fact, it may fail to reveal important 
effects because of the lack of an investigative Strategy. 

Someone more familiar with scientific methodology might make systematic 
comparisons of various levels of each factor, holding the others constant. However, 
while this approach is more reliable than the trial-and-error approach, it can still cause 
you to overlook important effects. Consider the following hypothetical response plot. 
The contours indicate points of equal response. 


1-351 


Design of Experiments 


40 20 30 40 50 60 70 80 90 100 
x1 


If you tried the one-at-a-time approach, your ability to accurately measure the effects 
of the variables would depend on the initial settings you chose. For example, suppose 
you choose the point indicated by the horizontal line as your fixed starting value for x; 
as you varied xı, you would conclude that the maximum response occurs when 

x, = 47. Then, you would fix x, at 47 and vary x, concluding that the maximum 
response occurs when x; — 98. The two following figures illustrate this problem. 
However, it is clear from the previous contours that the maximum effect occurs where 
x, = 100 and x, = 220,or perhaps even somewhere outside the range that you have 


measured. 


о 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 
X1 (X2 held constant at 98.0) X2 (X1 held constant at 47.0) 


This illustrates the importance of considering the factors simultaneously. The only way 
to find the true effects of the factors on the response variable is to take measurements 
at carefully planned combinations of the factor levels, as shown below. Such designs 
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are called factorial designs. А factorial design that could be used to explore the 
hypothetical process would take measurements at high, medium, and low levels of 
each factor, with all combinations of levels used in the design. 


10 20 30 40 50 60 70 80 90 100 
x1 


Factorial designs can be classified into two broad types: full (or complete) factorials 
and fractional factorials, shown below. Full factorials (a) use observations at all 
combinations of all factor levels. Full factorials give a lot of insight into the effects of 
the factors, particularly interactions, or joint effects of variables. Unfortunately, they 
often require a large number of runs, which means that they can be expensive. 
Fractional factorials (b) use only some combinations of factor levels. This means that 
they are efficient, requiring fewer runs than their full factorial counterpart. However, 
to gain this efficiency, they sacrifice some (or all) of their ability to measure interaction 
effects. This makes them ill-suited to exploring the details of complex processes. 


(a) (b) 
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Fractional Factorial Design Types 


The following types of fractional factorial designs can be generated: 


m Homogeneous fractional. These are fractional designs in which all factors have the 
same number of levels. 

m Mixed-level fractional. These are fractional designs in which factors have different 
numbers of levels. 

m Box-Hunter. This is a set of fractional designs for two-level factors that can be 
specified based on the number of factors and the number of runs (as a power of 2). 

m Plackett-Burman. These designs are saturated (or nearly saturated) fractional 
factorial designs based on orthogonal arrays. They are very efficient for estimating 
main effects but rely on the absence of two-factor interactions. 


m Taguchi. These designs are orthogonal arrays allowing for a maximum number of 


main effects to be estimated from a minimum number of runs in the experiment 
while allowing for differences in the number of factor levels. 

m Latin square. These designs are useful when there are restrictions on 
randomization, where you need to isolate the effects of one or more blocking 
(or "nuisance") factors. In Latin square designs, all factors must have the same 
number of levels. Graeco-Latin squares and hyper-Graeco-Latin squares can also 
be generated when you need to isolate the effects of more than one "nuisance" 


variable. 


Analysis of Factorial Designs 


Factorial designs are usually analyzed as linear models. The models available for a 
design depend on the number of factors and their levels and whether the design is full 
or fractional. 

The simplest models are main-e 
by the following equation: 


ffects models. A main-effects model is summarized 


у = pra + В;+ ... + 


where у is the response variable and ou, ;,--. represent the treatment effects of the 
factors. This model assumes that all interactions are negligible. These models are 
useful for describing very simple processes and for analyzing fractional designs oflow 


resolution. They are also useful for analyzing screening designs, where the goal is not 
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necessarily to model all effects realistically but merely to identify influential factors for 
further study. 

The next level of model complexity, the second-order model, involves adding two- 
factor interaction terms to the equation. Following is an example for a two-factor 
model: 


Уук = UO; + В; d (ap); + Єк 


This model allows you to explore joint effects of factors taken in pairs. For example, 
the term (28), allows you to see whether the effect of the о. factor on у depends on 


the level of p . If this term is significant, you can conclude that the effect of a does 
indeed depend on the level of В. 


Response Surface Designs 


There are many situations in which it is not enough to know simply which factors affect 
a process. You need to know exactly what combination of values for the factors 
produces the desired result. In other words, you want to optimize your process in terms 
of the outcome of interest. For example, suppose you want to find the best combination 
of temperature and pressure for a chemical process, or you may want to identify the 
ideal soak time and developer concentration for a photographic development process. 
This is typically done by calculating a model of the response based on the factors of 
interest. The shape of the surface is examined in order to identify the point of 
maximum response (or minimum response for minimization problems). Such a model 


is called a response surface, and experimental designs for finding such models are 
called response surface designs. 


1-355 


Design of Experiments 


In many cases, the response surface must be considered in parts because when you 
consider all possible values for the factors involved, the surface can be quite complex. 
Because of this complexity, it is often not possible to build a mathematical model that 
truly reflects the shape of the surface. Fortunately, restricted portions of the response 
surface can usually be modeled successfully with relatively simple equations. 


To take advantage of this, experimenters often use a two-stage approach to modeling 
response surfaces. In the first stage. а "neighborhood" in the space defined by the 
factors is chosen and a simple linear model is constructed. If the linear model fits the 
data in that neighborhood, the model is used to find a direction of steepest ascent (or 
descent for minimization problems). The factor limits that define the neighborhood are 
then adjusted in the appropriate direction, defining a new neighborhood, and another 
linear design is used. This continues until the simple design no longer fits the data in 
that region. Then a more complex model is calculated, and an estimate of the maximum 
(or minimum) response point can be found (occasionally, it may happen that the 
surface is linear up to the boundary of your factor space, in which case you simply use 
the linear model to choose the boundary point that maximizes your response). 


Variance of Estimates and Rotatability 


In most cases, the purpose of building a mathematical model of a process is to make 
predictions about what would happen given a particular set of conditions that you have 
not measured directly. This is particularly true in the case of a response surface 


experiment—the surface you calculate is essentially a set of predictions for all possible 


combinations within the limits of your factor measurements. With an adequate model 
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and careful measurements, you can do a reasonably good job of predicting response 
throughout the response surface neighborhood of interest. 

When you make such predictions, however, you must accept the fact that the model 
is not perfect—there are often imperfections in your measurements, and the 
mathematical model almost never fits the true response function exactly. Thus, if you 
are to conduct the experiment repeatedly, you will get slightly different answers each 
time. The degree to which your predictions are expected to differ across multiple 
experiments is known as the variance of prediction, or V(} ). The value of V( y) 
depends on the design used and on where in the factor space you are calculating a 
prediction. V( ӯ ) increases as you get farther from the observed data points. Of course, 
you would like the portion of the design that produces the most precise predictions to 
be near the optimum that you are trying to locate. Unfortunately, you usually do not 
get to really know where the optimal value is (or in what direction it lies) when you 
start. 

To deal with the fact that you do not know where exactly the optimum is, you can 
use designs in which the variance of prediction depends only on the distance from the 
center of the design, not on the direction from the center. Such designs are called 
rotatable designs. First-order (linear) orthogonal models are always rotatable. Some 
central composite designs are rotatable. (In SYSTAT, the distance from the center is 
automatically chosen to ensure rotatability for unblocked designs. However, for 
blocked designs, the distance is chosen to ensure orthogonality of blocks, which may 
lead to nonrotatable designs). In addition, some Box-Behnken designs are rotatable, 
and most are nearly rotatable (meaning that directional differences in prediction 


Response Surface Design Types 


Two types of response surface designs are available: 
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Box-Behnken. These are second-order designs (which allow estimation of quadratic 
effects) based on combining a two-level factorial design with an incomplete block 
design. For these designs, factors need to be measured at only three levels. Box- 
Behnken designs are also quite efficient in requiring relatively few runs. 


Analysis of Response Surface Designs 


Response surface designs are analyzed with either a linear or a quadratic model, 
depending on the purpose of the design. If the purpose is hill-climbing, a linear model 
is usually adequate. If the purpose is to locate the optimum, then a quadratic model is 
needed. 

The linear model takes the form 


$ = Bot Bix, +... + Bere 


where k is the number of factors in the design. Similarly, the quadratic model is 
expressed as 


2 
у = Pot Bixi t+... + Beret Buxi + Вох t + Pac оке уже Вы 


In either case, the estimated equation defines the response surface. This surface is often 
plotted, either as a 3-D surface plot or а 2-D contour plot, to help the investigator 
visualize the shape of the response surface. 

Such analysis of response surface designs under the quadratic models are carried out 
in the feature Analyze => Response Surface Methods, where parameters of the 
surface are estimated and optimal settings of the factors are computed, along with 
contour and other useful plots. See Statistics ГУ: Chapter 7: Response Surface 


Methods. 


Mixture Designs 


Suppose that you are trying to determine the best blend of ingredients or components 
for your product. Initially, this appears to be a problem that can be addressed with a 
straightforward response surface design. However, upon closer examination, you 
discover that there is an additional consideration in this problem—the amounts of the 
ingredients are inextricably linked with each other. For example, suppose that you are 
trying to determine the best combination of pineapple and orange juices for a fruit 
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punch. Increasing the amount of orange juice means that the amount of pineapple juice 
must be decreased, relative to the whole. (Of course, you could add more pineapple 
juice as you add more orange juice, but this would simply increase the total amount of 


punch. It would not alter the fundamental quality of the punch). The problem is shown 
in the following plot. 


Pineapple Juice (Р.Л 
ө 
£ 


02 


00 02 04 05 os 19 


Orange Juice (ОЈ 


By specifying that the components are ingredients in a mixture, you limit the values 
that the amounts of the components can take. All of the points corresponding to one- 
gallon blends lie on the line shown in the plot. You can describe the constraint with the 
equation ОЈ + РЈ = 1 gallon. 

Now, suppose that you decide to add a third type of juice, watermelon juice, to the 
blend. Of course, you still want the total amount of juice to be one gallon, but with 
three factors you have a bit more flexibility in the mixtures. For example, suppose you 
want to increase the amount of orange juice, you can decrease the amount of pineapple 
Juice, the amount of watermelon juice, or both. The constraint now becomes 
OJ+ PJ+ WJ= 1 gallon. The combinations of juice amounts that satisfy this constraint 
lie in a triangular plane within the unconstrained factor space, 
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The feasible values for a mixture comprise а (k — 1)-dimensional region within the 
k-dimensional factor space (indicated by the shaded triangle). This region is called a 
simplex. The pure mixtures (made of only one component) are at the corners of the 
simplex, and binary mixtures (mixtures of only two components) are along the edges. 
The concept of the mixture simplex extends to higher-dimensional problems as well— 
the simplex for a four-component problem is a three-dimensional regular tetrahedron 
and so on. 

To generalize, you measure component amounts as proportions of the whole rather 
than as absolute amounts. When you take this approach, it is clear that increasing the 
proportion of one ingredient necessarily decreases the proportion(s) of one or more of 
the others. There is a constraint that the sum of the ingredient proportions must equal 
the whole. In the case of proportions, the whole would be denoted by 1.0, and the 


constraint is expressed as 
хунх that x= 1.0 


where хі, ..., ху are the proportions of each of the k components іп the mixture. 
Because of this constraint, such problems require a special approach. This approach 

includes using a special class of experimental designs, called mixture designs. These 

designs take into account the fact that the component amounts must sum to 1.0. 


Unconstrained Mixture Designs 


Unconstrained mixture designs allow factor levels to vary from the minimum to the 
maximum value for the mixture. Four unconstrained designs are available. See Cornell 


(1990) for more information on each. 


Lattice. Lattice designs allow you to specify the number of levels or the number of 
values that each component (factor) assumes, including 0 and 1. The selection of levels 
has no effect for the other three types of designs available because the number of 
factors determines the number of levels for each of them. As Cornell (1990) points out, 
the vast majority of mixture research employs lattice models; however, the other three 


types included here are useful in specific situations. 


Centroid. Centroid designs consist of every (non-empty) subset of the components, but 
only with mixtures in which the components appear in equal proportions. Thus, if we 
asked for a centroid design with four factors (components), the mixtures in the model 
would consist of all permutations of the set (1, 0, 0, 0), all permutations of the set (1/2, 
1/2, 0, 0), all permutations of the set (1/3, 1/3, 1/3, 0), and the set (1/4, 1/4, 1/4, 1/4). 
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Thus, the number of distinct points is 1 less than 2 raised (о the 4 power, where 4 is the 
number of components. Centroid designs are useful for investigating mixtures where 
incomplete mixtures (with at least one component absent) are of primary importance. 


Axial. In an axial design with m components, each run consists of at least (m - 1) equal 
proportions of the components. These designs include: mixtures composed of one 
component; mixtures composed of (m - 1) components in equal proportions; and 
mixtures with equal proportions of all components. Thus, if we asked for an axial 
design with four factors (components), the mixtures in the model would consist of all 
permutations of the set (1,0,0,0), all permutations of the set (5/8, 1/8, 1/8, 1/8), all 
permutations of the set (0, 1/3, 1/3, 1/3), and the set (1/4, 1/4, 1/4, 1/4). 


Screen. Screening designs are reduced axial designs, omitting the mixtures that contain 
all but one components. Thus, if we asked for a screening design with four factors 
(components), the mixtures in the model would consist of all permutations of the set 
(1, 0, 0, 0), all permutations of the set (5/8, 1/8, 1/8, 1/8), and the set (1/4, 1/4, 1/4, 1/4). 
Screening designs enable you to single out unimportant components from an array of 
many potential components. 


Constrained Mixture Designs 


You can also consider mixture problems with additional constraints on the mixture 
values. For example, suppose that orange juice is much cheaper than other kinds of 
juice, and you therefore decide that your punch must contain at least 50% orange juice. 
However, you also want to make sure that your punch is sufficiently distinct from pure 
orange juice, so you place another restriction—that orange juice can make up no more 
than 75% of the punch. These criteria place additional constraints on your mixture, 
specifically 0.5 < OJ € 0.75. This restricts the range of feasible solutions in the 
simplex, as shown below by the outlined area. 
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Analysis of Mixture Designs 


In mixture experiments, you are usually trying to find an optimal mixture, according 
to some criterion. In this sense, mixture models are related to response surface models. 
However, the constraint on the sum of the component values takes away one degree of 
freedom from the model. This can be accommodated by reparameterizing the linear 
model so that there is no intercept term. (This is also known as a Scheffé model.) Thus, 


the linear model is specified as 
ў = Вах + а. + Box, 
and the quadratic form is 
ӯ = Вах + Box. +... + Вих "Врх; +... + Pa- ox) 
e that the quadratic form does not include 


for mixtures with k components. Notic 
squared terms. Such terms would be redundant, since the square of a component can 
d cross-product terms. For example, 


be re-expressed as a function of the linear ап 


2 

ху = х\(1=х›—... =x) = х0 o XX 
The model is estimated using standard general linear modeling techniques. The 
parameters can be tested (with a sufficient number of observations), and they can be 
used to define the response function. The plot of this function can give visual insights 
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into the process under investigation, and allows you to select the optimal combination 
of components for your mixture. 


Optimal Designs 


In going through the process of designing experiments, you might ask yourself, “What 
is the advantage of a designed experiment over a more haphazard approach to 
collecting data?” The answer is that a carefully designed experiment will allow you to 
estimate the specific model you have in mind for the process, and it will allow you to 
do so efficiently. Efficiently in this context means that the model can be estimated with 
high (or at least adequate) precision, with a manageable number of runs. 

Through the years, statisticians have identified useful classes of research problems 
and developed efficient experimental designs for each. Such classes of problems 
include identifying important effects within a set of two-level factors (Box-Hunter 
designs), optimizing a process using a quadratic surface (central composite or Box- 
Behnken designs), or optimizing a mixture process (mixture designs). 

One of the standard designs may be appropriate for your research needs. 
Sometimes, however, your research problem does not quite fit into the mold of these 
standard designs. Perhaps you have specific ideas about which terms you want to 
include in your model, or perhaps you cannot afford the number of runs called for in 
the standard design. The standard design’s efficiency is based on assumptions about 
the model to be estimated, the number of runs to be collected, and so on. When you try 
to run experiments that violate these assumptions, you lose some of the efficiency of 
the design. 

You may now be asking yourself, “Well, then, how do I find a design for my 
idiosyncratic experiment? Is there a way that I can specify exactly what I want and get 
an efficient design to test it?” The answer is yes—this is where the techniques of 
optimal experimental design (often abbreviated to optimal design) come in. Optimal 
design methods allow you to specify your model exactly (including number of runs) 
and to choose a criterion for measuring efficiency. The design problem is then solved 
by mathematical programming to find a design that Maximizes the efficiency of the 
design, given by your specifications. The use of the word optimal to describe designs 
generated in this manner means that we are optimizing the design for maximum 
efficiency relative to the desired efficiency criterion. 
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Optimization Methods 


First, you need to choose an optimization method. Different mathematical methods 
(algorithms) are available for finding the design that optimizes the efficiency criterion. 
Some of these methods require a candidate set of design points from which to choose 
the points for the optimal design. Other methods do not require such a candidate set. 


Three optimization methods are available: 


Fedorov method. This method requires a predefined candidate set. It starts with an 
initial design, and at each step it identifies a pair of points—one from the design and 
one from the candidate set—to be exchanged. That is, the candidate point replaces the 
selected design point to form a new design. The pair exchanged is the pair that shows 
the greatest reduction in the optimality criterion when exchanged. This process repeats 
until the algorithm converges. 


k-exchange method. This method starts with a set of candidate points and an initial 
design and exchanges the worst k points at each iteration in order to minimize the 
objective function. Candidate points must come from a previously generated design. 


Coordinate exchange method. This method does not require a candidate set. It starts 
with an initial design based on a random starting point. At each iteration, k design 
points are identified for exchange, and the coordinates of these points are adjusted one 
by one to minimize the objective function. The fact that this method does not require a 
candidate set makes it useful for problems with large factor spaces. Another advantage 
of this method is that one can use either continuous or categorical variables, or a 
mixture of both, in the model. 

For the designs that require а candidate set, that set must be defined before you 
generate your optimal design. The set of points must be in a file that was generated and 
saved by the Design Wizard. You may eliminate undesirable rows before using the file 
in the Fedorov or k-exchange method to generate an optimal design based on the 
candidate design. The same requirements hold for any so-called starting design in a file 
that is submitted by the user. 

It is important to remember that these m 
designs with a random component to them. Therefore, they will not always converge 
on a design that is absolutely optimal—they may fall into a local minimum or saddle 
point, or they may simply fail to converge within the allowed number of iterations. 


That is why each method allows you to generate a design multiple times based on 


different starting designs. 


ethods are iterative, based on starting 


1-364 


Chapter 10 


Efficiency Criteria for Optimal Designs 


You may have noticed that no explicit mathematical definition of efficiency was given 
in the discussion above. This is because there are several different ways of defining and 
measuring efficiency of designs. Because the object of optimal design is to minimize 
a specific efficiency criterion, the values used to measure efficiency are also called 
optimality criteria in this context. You can choose from three optimality criteria: 


D-optimality. This criterion measures the generalized variance of the parameter 
estimates in the model. The generalized variance is the determinant of the parameter 
dispersion matrix: D = |(X'X) |, where X is the design matrix. The square root of 
this value is proportional to the volume of the confidence ellipsoid about the parameter 
estimates. The design is generated to minimize D (D stands for determinant). 


A-optimality. This criterion measures the average (or, equivalently, the sum) of the 
variances for the parameter estimates. Minimizing this criterion, measured as the trace 
of the parameter dispersion matrix 4 — trace[( X'X)!] , yields the design with the 
smallest average variance for the parameter estimates. The design is generated to 
minimize A (A stands for average). 


G-optimality. This criterion focuses on the variance of predicted response values rather 
than the variance of the parameter estimates. The variance of predictions varies across 
the factor space (that is, as different levels of the factors are examined). This criterion 
specifically measures the maximum variance of prediction within the factor space, and 
seeks to minimize this maximum value, G = max v(x), where v(x) is the variance of 
the prediction at design point x (The G stands for global). 


In most circumstances, these methods will give similar results. Using G-optimality can 
take more time to compute, since each iteration involves both maximization and 
minimization. In many situations, D-optimality will be a good choice because it is fast 
and invariant to linear transformations. A-optimality is especially sensitive to the scale 
of continuous factors, such that a design with factors having very different scales may 
lead to problems while generating a design. | 


Analysis of Optimal Designs 


Analysis of optimal designs closely parallels the analysis of other experimental 


designs. The general linear model (GLM) is used to build an equation for the model and 
estimate and test effects. 
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There is one important difference. For an optimal experiment, you specify the 
model for the experiment before you generate the design. This is necessary to ensure 
that the design is optimized for your particular model, rather than an assumed model 
(such as a complete factorial or a full quadratic model). This means that for optimal 
designs, the form of the equation to be estimated is an integral part of the experimental 
design. 

Let’s consider a simple example: suppose that you have three two-level factors (call 
them A, B, and C), and you want to perform tests of the following effects: A, B, C, AB, 
and АС. You could use the usual 23 factorial design, which would give you the 


following runs: 


век з ые Ек ЕУ бе aoe: - 
моо = -00 ш 
= © = о = oro 4 


Now, suppose that you want to estimate the model in only six runs. There is no standard 
design for this, so you must use an optimal design. Using the coordinate exchange 
method with the D-optimality criterion yields the following design: 


A B c 
1 1 0 
0 0 0 
1 1 1 
1 0 0 
0 0 1 
0 1 0 
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However, if we change the form of the model slightly, so that we аге asking for the A, 
B, C, AB, and BC effects, we get a slightly different design: 


A С 


~ о о о о – 
— © © ~ ~ о ш 


1 
1 
0 
0 
1 
0 


In general, a design generated based оп one model will not be a good design for a 
different model. The implication of this is that the model used to generate the design 
places limits on the model that you estimate in your data analysis. In most cases, the 
two models will be the same, although you may sometimes want to omit terms from 
the analysis that were in the original model used to generate the design. 


Choosing a Design 


Deciding which design to use is an important part of the experimental design process. 
The answer will depend on various aspects of the research problem at hand, such as: 


What type of knowledge do you want to gain from the research? 

How much background knowledge can you bring to bear on the question? 
How many factors are involved in the process of interest? 

How many different ways do you want to measure the outcome? 

What are the constraints, if any, on your factors? 

What is your criterion for the “best design"? 


Will you have to use the results of the experiment to convince others of your 
conclusions? What will they find convincing? 


т What are the constraints on your research process in terms o 


f time, money, human 
resources, and so forth? 
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Defining the Question 


Successful research depends on how well the problem is formulated. It does no good 
to run an elaborate, highly efficient experiment if it gives you the answer to the wrong 
question. Spend some time and effort carefully considering your problem. Doing so 
will help to ensure that your experimental design will give you the information you 
need to solve your problem. 


Identifying Candidates for Factors and Responses 


In most cases, it is most efficient to focus on only the important factors and ignore the 
inconsequential ones. However, you should not be too eager to eliminate factors from 
your experiment. Leaving out even one crucial factor can seriously hinder your ability 
to find true effects and can lead to highly misleading results. If there is any doubt about 
a factor, it is usually best to include it. Once you have empirical confirmation that its 
effect is really negligible, you can delete it from subsequent models. 

If there is not much background knowledge available to help in your factor 
selection, you should consider employing а screening design. These designs allow 
you to test for main effects with a small number of runs. Such designs allow you to 
examine a large number of candidate factors without exhausting your resources. Once 
you have identified a set of interesting factors, you can move on to a fuller design to 


test for more complex effects. 


Setting Priorities 


Consider what is really important in your study. Do you need the highest precision 
possible, regardless of what it takes? Or are you more concerned about controlling 
costs, even if it means settling for an approximate model? Would the cost of 
overlooking an effect be greater than the cost of including the effect in your model? 
Giving careful thought to questions like these will help you choose a design that 
satisfies your criteria and helps you accomplish your research goals. 
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Design of Experiments іп SYSTAT 


Design of Experiments Wizard 


To access the Design of Experiments Wizard, from the menus choose: 
Utilities 
Design of Experiments 
Wizard... 


Utilities: Design of Experiments: Wizard 


Factorial designs Response surface designs 
б) General factorial © BoxBehnken 

© ВохНитег © Central composite 

O Latin square 
© Taguchi 


О Plackett-Burman O Optimal 
© Mixture model 


Additional designs 


m 


The Design of Experiments Wizard offers nine different design types: General 
factorial, Box-Hunter, Latin square, Taguchi, Plackett-Burman, Box-Behnken, Central 
composite, Optimal, and Mixture model. After selecting a design type, a series of 
dialogs prompts for design specifications before generating a final design matrix. 


These specifications typically include the number of factors involved, as well as the 
number of levels for each factor. 


Replications. For any design created by the Design Wizard, replications can be saved 
to a file. By default, SYSTAT saves the design without replications. If you request л 

copies of a design, the complete design will be repeated n times in the saved file (global 
replication). If local replications are desired, simply sort the saved file on the variable 


named RUN to group replications by run number. Replications do not appear on the 
output screen. 
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Note: It is not necessary to have a data file open to use Design of Experiments. 


Classic Design of Experiments 


To access the classic Design of Experiments dialog box, from the menus choose: 


Utilities 
Design of Experiments 
Classic... 


Classic DOE offers 


a subset of the designs available using the Design Wizard, 


including Factorial, Box-Hunter, Latin square, Taguchi, Plackett, Box-Behnken, and 
Mixture designs. In contrast to the Wizard, Classic DOE uses a single dialog to define 
all design settings. The following options are available: 


Levels. For factorial, Latin, and mixture designs, 


this is the number of levels for the 


factors. Factorial designs are limited to either two or three levels per factor. 


Factors. For factorial, Box-Hunter, 


Box-Behnken, and lattice mixture designs, this is 


the number of factors, or independent variables. 
Runs. For Plackett and Box-Hunter designs, this is the number of runs. 


Replications. 
replications. 


For all designs except Box-Behnken and mixture, this is the number of 
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Mixture type. For mixture designs, you can specify a mixture type from the drop-down 
list. Select either Centroid, Lattice, Axial, or Screen. 


Taguchi type. For Taguchi designs, you can select a Taguchi type from the drop-down 
list. 


Save design. This option saves the design to a file. 

Print options. The following two options are available: 

= Use letters for labels. Labels the design factors with letters instead of numbers. 
W Print latin square. For Latin Square designs, you can print the Latin square. 
Design options. The following two options are available: 

= Randomize. Randomizes the order of experimentation. 


" Include blocking factor. For Box-Behnken designs, you can include a blocking 
factor. 


Using Commands 


With commands: 


DESIGN 

SAVE filename or WORK filename 

FACTORIAL / FACTORS=n REPS=n LETTERS RAND, LEVELS=2 or 3 

BOXHUNTER / FACTORS=n RUNS=n REPS=n LETTERS RAND 

LATIN / LEVELS=n SQUARE REPS=n LETTERS RAND 

TAGUCHI / TYPE=design REPS=n LETTERS RAND 

PLACKETT / RUNS=n REPS=n LETTERS RAND 

BOXBEHNKEN / FACTORS=n BLOCK LETTERS RAND REPS=n 

MIXTURE / TYPE=LATTICE ок CENTROID or AXIAL or SCREEN, 
FACTORS=n LEVELS=n RAND LETTERS REPS=n 


Note: Some designs generated by the Design Wizard cannot be created using 
commands, 


Usage Considerations 


Types of data. No data file is needed to use Design of Experiments, 


Print options. For Box-Hunter designs, using РЈЕМСТ 
a listing of the generators (confounded effects) for the 
table defining the interaction is available, 


H LONG in Classic DOE yields 
design. For Taguchi designs, a 
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Quick Graphs. No Quick Graphs are produced. 
Saving files. The design can be saved to a file. 
BY groups. Analysis by groups is not available. 


Case weights. Case weighting is not available in Design of Experiments. 


Examples 


Example 1 
Full Factorial Designs 


The DOE Wizard input for a (2 x 2 x 2) design is: 


Wizard Prompt Response 

Design type General Factorial 
Choose a type of design: Full Factorial Design 
Divide the design into incomplete blocks? No 

Enter the number of factors desired: 3 

Is the number of levels to be the same for all factors? Yes 

Enter number of levels: 2 

Display the factors for this design? Yes 

Save the design to a file? No 


The output is: 


Factorial Design: 3 Factors, 8 Runs 


E Factor 

жазғы МЛН cipere ERE 
Run | A B с 
i; 0 0 0 
2; 0 0 1 
21 9 1 0 
4; 0 1 1 

5 | 1 0 0 
6") “2 0 1 
"а 4 1 0 
ey 7: 1 1 


To generate this design using commands, the input is: 


DESIGN 
FACTORIAL / FACTORS = 3 LEVELS = 2 


A 
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Ехатріе 2 
Fractional Factorial Design 


The DOE Wizard input fora(2x 2x 2x 2) fractional factorial design in which the two- 


way interactions 4*B and 4*C must be estimable is: 


Wizard Prompt 

Design type 

Choose a type of. design: 

Divide the design into incomplete blocks? 

Enter the number of factors desired: 

Is the number of levels to be the same for all factors? 
Enter number of levels: 


Please choose: 


Choose a Search Criterion 
May main effects be confounded with 2-factor 
interactions? 


Are there any specific effects to be estimated other than 
the effects already cited? 


List them by using the appropriate factor letters separated 
by asterisks for interactions. 


Are there any effects that are not to be estimated, but yet 
should not be confounded with effects that are to be 
estimated? 


List them by using the appropriate factor letters separated 
by asterisks for interactions. 


Display the factors for this design? 

Save the design to a file? 

Display another fraction of this design? 
Find another design with same parameters? 


The output is: 

Complete Defining Relation 
Identity: 
B, * с“ р 


The design resolution is 3 


Response 

General Factorial 
Fractional Factorial Design 
No 


Automatically find the smallest. 

lesign consistent with my criteria 
Require that specific effects be 
estimable 


Yes 


Yes 


A*B 
ATC 


Yes 


A*D 
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Design Generators 
Identity: 
B*C*D 


Fractional Factorial Design: 4 Factors, 8 Runs 


=–=еоосо»> 
==оо-ноот 
=о=-ононоо 
orooro 


SYSTAT assumes that the main effects of any design should always be estimated. 
Notice, however, that the defining relation avoids confounding the interaction of A 
with any of the other factors, as requested by specifying the effects to be estimated 
(A*B, A*C) and effects that should not be confounded even though they are not to be 


estimated (4%). 


Example 3 
Box-Hunter Fractional Factorial Design 


To generate a (2 x 2 х 2) Box-Hunter fractional factorial, the input is: 


Wizard Prompt Response 
Design type Box-Hunter 
Enter the number of factors desired: 3 


Enter the total number of cells for the 4 
entire design: 


Display the factors for this design? Yes 
Save the design to a file? No 
Display another fraction for this design? No 


The output is: 

Complete Defining Relation 
Identity: 
A* в °С 


The design resolution із 3 
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Aliases 


Design Generators 
Identity: 
девер 


Box-Hunter Design: 4 Runs, 3 Factors 


i Factor 
DIE eee LL Lud 
Run | A B с 
1!-1 “1 1 
21-1 1 =1 
a yd 1 ag 
AS: k 1 


To generate this design using commands, the input is: 


DESIGN 
BOXHUNTER / FACTORS = 3 


For 7 two-level factors, the number of cells (runs) for a complete factorial is 27-128. 
The following example shows the smallest fractional factorial for estimating main 
effects. The design codes for the first three factors generate the last four. 


The input is: 
Wizard Prompt Response 
Design type Box-Hunter 
Enter the number of factors desired: 7 
Enter the total number of cells for the entire design: 8 
Display the factors for this design? Yes 
Save the design to a file? No 
Display another fraction for this design? No 
The output is: 
Complete Defining Relation 
Identity: 
A*B* p= 
A*C*E. 
БАЙМЕН ta a 
B*C* Fe 
àre D* Fe 
A*BtE* Pe 
D*g*f . 
A*B*C*Gz 
C*p*G- 
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G = 


жаш>»мош 
“ж 
monom 


Оазет 


»""omo 


соя + 


G= 
р * G= 
A B ^E*E*G 
The design resolution is 3 
Design Generators 


Identity: 


»U»» 


* 
" 
D 


"oou 
жж. 
ammo 
жи на 


а 


Fractional Factorial Design: 7 Factors, 8 Runs 


Factor 


1 
1 1 1 =, 1 22054 
1 1 i. эф boe 1 


The main effect for factor D is confounded with the interaction between factors 4 and 
B; the main effect for factor E is confounded with the interaction between factors А and 


C; and so on. 


Example 4 
Latin Squares 


To generate a Latin square when each factor has four levels, enter the following DOE 
Wizard responses: 


Wizard Prompt Response 

Design type Latin Square 

The types available are: Ordinary Latin Square 
Number of levels: 4 

Randomize the design? No 

Display the square? Yes 


Display the factors for this design? | Yes 
Save the design to a file? No 
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The output is: 


Latin Square: 4 Levels 


Latin Square Design: 4 Levels. 


i Factor 
icc а Н 
Run | A B с 
зыр” 0 0 
2,4 6 1 1 
САВИЈА) 2 2 
41 0 3 3 
51.1 0 1 
ВТ 1 2 
Jot 2 3 
Bu д 3 0 
25096) 0 2 
10 t.-$ 1 3 
LI 1:2 2 0 
i2! 2 3 гі 
131.3 0 3 
ЖАУЫН а 1 0 
to 3 2 1 
19 43 3 2 


To generate this design using commands, enter the following: 
DESIGN 
LATIN / LEVELS = 4 SQUARE LETTERS 


Omitting SQUARE prevents the Latin square from appearing in the output. 


Permutations 


To randomly assign the factors to the cells, the input is: 


Wizard Prompt Response 

Design type Latin Square 

The types available are: Ordinary Latin Square 
Number of levels: 4 

Randomize the design? Yes 

Display the square? Yes 


Display the factors for this design? No 
Save the design to a file? No 
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The output is: 


Latin Square: 4 Levels 


woo» 
o»o0u 
awro 
»ouo 


Using commands: 


DESIGN 
LATIN / LEVELS-4 SQUARE LETTERS RAND 


Example 5 
Taguchi Design 
To obtain a Taguchi L12 design with 11 factors, the DOE Wizard input is: 
Wizard Prompt Response 
Design type Taguchi 
Taguchi design type: L12 
Display the factors for this design? | Yes 
Save the design to a file? No 
Display confounding matrix? No 
The output is: 
Taguchi L12 Design (12 Runs, 11 Factors, 2 Levels Each) 
i Factor 
— НИИ ө мам а S La LB a саро. 
вш [А В Еш А 5 NW Ж 
ANE 2, 4312.1 te 34 $C 
215. 3. e E s ieee Ede ane 
з НТ 1722 d d A038 26 5 
їїї 2 “ж E271 5 СӘШІШ % 
51 3 2 ЖҚТ лам tees 
біз 2 Шева А eii 
7T! 2 1 ЭЕТ 0 2291 2H 
812 3.3 2.22.2 2 HI END 
912 1 Ұ дм Bees EU MM 
1012 2 2 Ж f du M EE 
11127 2 1$ S Mau 
212 2 # Pa X ee ees 


То generate this design using commands, enter the following: 


DESIGN 
TAGUCHI / TYPE = 112 


18 


tases 


Response 
5 


Taguchi 
116 

Ye: 

No 

Yes 


with 15 Two-Level Factors Plus Al 
is 
Factor 


To obtain a Taguchi L16 design with 15 factors, the input 
Taguchi L16 Design (16 Runs, 15 Factors, 2 Levels Each) 


Wizard Prompt 

Taguchi design type: 

Display the factors for this design? 
Save the design to a file? 

Display confounding matrix? 


Design type 
The output 


Design L16 
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O c 6 0 СЧ e e LOL e Le с с ч 


ZANNFAN AANA NN ANA AN 


ZANNTANN AN ч СЧ СЧ ч 4 сч 


AANNAANNAANN н с саса з 


MANANN ANANTH NL e NL с 


RANANN ANG eL NL СЧ OL NL 


кесе се се СЧ СМ ч с се са 


песесесчесчесчн-сесе с 


Оча сч ТЕТЕ 


а СЧ С СЧ еМ ч ч S сч су LL + 


B] n са сул са са NN A су са + 


ачама ANN TANNIN 


оч С С С су СЧ Су 4 ч ч ч 


£O e e Ч СЧ СЧ С 4 ч e Су сч OL СЧ 


Kim n e n а а А ОМ с СЧ СЧ СЧ СЧ са сч 


3 =заачана 
к 


(Note that partial confounding do not appear.) 
Factor 


Confoundings for each Pairwise Interaction 


о 


za 


zom 


agau 


коња 


эж<шосоаш 


мошшасьы 


ш<асосымо 


оожжыамееш 


me ZORA EDM EH 


mMOmzaozemxn 


a«moazzzommuanx 


CUEMAMHHDOZS. 


оењодсшњо>мшеломзи 


&Omimnoaotamxurxioz 


«moantmumomeoxaixzo 
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This design can also be generated by the following commands: 


DESIGN 
PLENGTH LONG 
TAGUCHI / TYPE = L16 


The matrix of confoundings identifies the factor pattern associated with the interaction 
between the row and column factors. For example, the factor pattern for the interaction 
between factors 6 and 8 is identical to the pattern for factor 14 (М). 


Example 6 
Plackett-Burman Design 


To generate a Plackett-Burman design consisting of 11 two-level factors, the DOE 
Wizard input is: 


Wizard Prompt Response 
Design type Plackett-Burman 
Number of levels in design: 2 
Runs per replication 12 
Display the factors for this design? Yes 
Save the design to a file? No 
The output is: 
Plackett-Burman Design: 12 Runs, 11 Factors 
i Factor 
nae НИИ ee ka GMT etn 
Run, | À ов Т ноти а В мек 
fI i 1 > ҚУАР но о анале 
28 o 132g x0. опире wc 1 
22 УД о drei qs LR MI 
4! 1 1 iN P О Ниво 
51 1 WE Ere UTD DX S 
651! 1 Qu TUE UE uM UNS 
EBORE a T5 UU LA 
$41 о ЕТТ ЕТА: 
31 0 X1 Шаш CU s. Cad 
10 | 1 D. X 2. p RT PC ODER AOL 
10:11: (мұ D МИТ ME 401; 91 
12 | 0 0 0 70, 0570 E DINERO SO 720 


To generate this design using commands, the input is: 


DESIGN 
PLACKETT / RUNS - 12 
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Example 7 
Box-Behnken Design 


Each factor in this example has three levels. 


The input is: 


Wizard Prompt Response 


Design type Box-Behnken 
Number of factors 3 

Display the factors for this design? Yes 

Save the design to a file? No 


The output is: 


Box-Behnken Design: 3 Factors, 15 Runs 


Factor 
€— pe————————— 


D 
0 
нын 


0 
Pee ee RED 


Фо 3 сло 
1 


ooooooor- 
осо==-=н-оооо 
[ 
m 


= 
~ 


To generate this design using commands, the input is: 


DESIGN 
BOXBEHNKEN / FACTORS = 2 
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Example 8 
Mixture Design 


We illustrate a lattice mixture design in which each of the three factors has five levels; 
that is, each component of the mixture is 0%, 25%, 50%, 75%, or 100% of the mixture 
for a given run, subject to the restriction that the sum of the percentages is 100. To 
generate the design for this situation using the DOE Wizard, enter the following 
responses at the corresponding prompt: 


Wizard Prompt Response 
Design type Mixture Model 
Are there to be constraints for any component(s)? No 

The possible kinds of unconstrained design are: Lattice 

Enter the number of mixture components: У 

Enter the number of levels for each component: 5 

Display the factors for this design? Yes 

Save the design to a file? No 


The resulting mixture design follows: 


Lattice Design: 3 Factors, 15 Runs, 5 Levels 


i Factor 
Run } A B с 
Bias >а + tik I at (ei UR E 
1 | 1.000 0.000 0.000 
2 | 0.000 1.000 0.000 
3 | 0.000 0.000 1.000 
4 | 0.750 0.250 0.000 
5 | 0.750 0.000 0.250 
6 | 0.000 0.750 0.250 
7 | 0.500 0.500 0.000 
8 | 0.500 0.000 0.500 
9 ; 0.000 0.500 0.500 
10 | 0.250 0.750 0.000 
11 ! 0.250 0.000 0.750 
12 | 0.000 0.250 0.750 
13 ! 0.500 0.250 0.250 
14 | 0.250 0.500 0.250 
15 | 0.250 0.250 0.500 


To generate this design using commands, the input is: 
DESIGN 


MIXTURE / TYPE - LATTICE FACTORS - 3, 
LEVELS - 5 


After collecting your data, you may want to display it in a triangular scatterplot. 
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Ехатріе 9 


Mixture Design with Constraints 


This example is adapted from an experiment reported in Cornell (1990, p. 265). The 
problem concerns the mixture of three plasticizers in the production of vinyl for car 
seats. We know that the combination of plasticizers must make up 79.5% of the 
mixture. There are further constraints on each of the plasticizers: 

32.5% <= P1 <= 67.5% 

0% <= Р2 <= 20.0% 

12.0% <= P3 <= 21.8% 


Because we are interested in only the plasticizers, we сап model them separately from 
the other components in the overall process. Taking this approach, we can 
reparameterize the components by dividing by 79.5%, giving 

0.409 <= A <= 0.849 

0<-В<-0.252 

0.151 <= C <= 0.274 


We want to be sure that the design points span the feasible region adequately. To 
generate the design using the DOE Wizard, the responses to the prompts follow: 


Wizard Prompt Response 
Design type Mixture Model 
Are there to be constraints for any Yes 
component(s)? 

The possible kind of constrained design are: Extreme vertices plus centroids 
Enter the number of mixture components: 3 

Enter the maximum dimension to be used to 1 

compute centroids: 

How many such constraints do you wish tohave? 5 

Constraint 1: Enter the coefficient for factor 1: 1 

Constraint 1: Enter the coefficient for factor 2: 0 

Constraint 1: Enter the coefficient for factor 3: 0 

Constraint 1: Enter an additive constant: -.409 
Constraint 2: Enter the coefficient for factor 1: -1 

Constraint 2: Enter the coefficient for factor 2: 0 

Constraint 2: Enter the coefficient for factor 3: 0 

Constraint 2: Enter an additive constant; .849 
Constraint 3: Enter the coefficient for factor 1: 0 


Constraint 3: Enter the coefficient for factor 2: -1 
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Constraint 3: Enter the coefficient for factor 3: 0 
Constraint 3: Enter an additive constant: .252 
Constraint 4; Enter the coefficient for factor 1: 0 
Constraint 4: Enter the coefficient for factor 2: 0 
Constraint 4: Enter the coefficient for factor 3: 1 
Constraint 4: Enter an additive constant: -.151 
Constraint 5: Enter the coefficient for factor 1: 0 
Constraint 5: Enter the coefficient for factor 2: 0 
Constraint 5: Enter the coefficient for factor 3: -1 
Constraint 5: Enter an additive constant: .274 
Specify the tolerance for checking constraints 00001 
and duplication of points: ] 
Display the factors for this design? Yes 
Save the design to a file? No 


The constrained mixture design output is: 


The following are index numbers of input constraints found to be redundant: 


1 
2 


Extreme Vertices % Centroids Design: 3 Factors, 9 Runs, 4 Vertices 


i Factor 

Run | A B c 
таа ipia ape iM EE: 
1 | 0.849 0.000 0.151 
27! 0.597 0.252 0.151 

3 | 0.726 0.000 0.274 

4 | 0.474 0.252 0.274 

5 | 0.787 0.000 0.213 

6 | 0.535 0.252 0.213 
7.1 0.723 0.126 0.151 

8 | 0.600 0.126 0.274 

9 | 0.661 0.126 0.213 


The design contains nine runs: four points at the extreme vertices of the feasible region, 
four points at the edge centroids, and one point at the overall centroid. The following 
plot displays the constrained region for the mixture as a blue parallelogram with the 
actual design points represented as red filled circles. 
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Ехатріе 10 
Central Composite Response Surface Design 


In an industrial experiment reported by Aia et al. (1961), the authors investigated the 
response surface of a chemical process for producing dihydrated calcium hydrogen 
orthophosphate (СаНРО)“2Н-0). The factors of interest are the ratio of NH 3 to CaCl, 
in the calcium chloride solution, the addition time of the NH3-CaCl, mixture, and the 
beginning pH of the NH4H PO, solution used. We will now see how this experiment 
would be designed using the DOE Wizard, 

For efficiency and rotatability, we use a central composite design with three factors. 
The central composite design consists of a 2 factorial (or fraction thereof), а set of 2k 
axial (or “star”) points on the axes of the design space, and some number of center 


The choice of number of center points hinges on the desired properties of the design. 
Orthogonal designs (designs in which the factors are uncorrelated) minimize the 
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point one unit distant from the center. This property of equal variance between the 
center of the design and points one unit from the center is called uniform precision. In 
this example, we sacrifice orthogonality in favor of uniform precision. Therefore, we 
use six center points instead of the nine points required to make the design nearly 
orthogonal. (A table of orthogonal and uniform precision designs with appropriate 
numbers of center points can be found in Montgomery, 2000.) 


The input to generate the central composite design is: 


Wizard Prompt Response 
Design type Central Composite 
Enter the number of factors desired: 3 
Are the cube and star portions of the design to be separate blocks? No 
Enter number of center points desired: 6 
Display the factors for this design? Yes 
Save the design to a file? No 
The output is: 
Second-order Composite Design: 3 Factors 20 Runs 
i Factor 
Run | A B с 
cuui. КЕК: КИШ 7... 
11 -1.000  -1.000 -1.000 
21 -1.000 -1.000 1.000 
3 | -1.000 1.000 -1.000 
а | -1.000 1.000 1.000 
5 | 1.000  -1.000  -1.000 
6 | 1.000  -1.000 1.000 
7 | 1.000 1.000 -1.000 
8 | 1.000 1.000 1.000 
9 | -1.682 0.000 0.000 
10! 1.682 0.000 0.000 
11 ! 0.000  -1.682 0.000 
12 | 0.000 1.682 0.000 
13 | 0.000 0.000 -1.682 
14 | 0.000 0.000 1.682 
15 | 0.000 . 0.000 0.000 
16 | 0.000 0.000 0.000 
171 0.000 0.000 0.000 
18 | 0.000 0.000 0.000 
19 ! 0.000 0.000 0.000 
20 | 0.000 0.000 0.000 


In the central composite design, each factor is measured at five different levels. The 


runs with no zeros for the factors are the factorial (“cube”) points, the runs with only 


one nonzero factor are the axial (“star”) points, and the runs with all zeros are the center 


points. 
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After collecting data according to this design, fit a response surface to analyze the 
results. You could also find the optimal levels of factors A,B,C by using the feature 
Response Surface Methods. 


Example 11 
Optimal Designs: Coordinate Exchange 


Consider a situation in which you want to compute a response surface but your 
resources are very limited. Assume that you have three continuous factors but can 
afford only 12 runs. This number of runs is not enough for any of the standard response 
surface models. However, you can generate a design with 12 runs that will allow you 
to estimate the effects of interest using an optimal design. 


To generate the design using the DOE Wizard, the responses to the prompts is: 


Wizard Prompt Response 

Design type Optimal 

Choose the method to use: Coordinate Exchange 
Choose the type of optimality desired: D-optimality 

Specify the number of points to replace in а single iteration: 1 

Specify the maximum number of iterations within a trial: 100 

Specify the relative convergence tolerance: 00001 

Specify the number of trials to be run: 3 

Random number seed: 131 

The starting design is to be: Generated by the program. 
Enter the number of factors desired: 3 

How many points (runs) are desired? 12 


The variables in the design are: АП continuous 


Limits for factor A lower limit = -1 
upper limit = | 
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lower limit = -1 
upper limit = 1 
lower limit = -1 
upper limit = 1 
Does the model for your designed design contain an additive — y... 


Limits for factor B 


Limits for factor C 


constant? 
A*A 
B*B 
СРС 
Define other effects to be included іп the model: A*B 
A*C 
B*C 
A*B*C 
Display the factors from trial 1 Yes 
Save the design to a file? No 
Display the factors from trial 2 Yes 
Save the design to а file? No 
Display the factors from trial 3 Yes 
Save the design to a file? No 


The design that was output on the third trial is: 


Design from Coordinate-exchange Algorithm: 12 Runs, 3 Factors k=1 


RUN Factor 
A B с 

1 -1.000 1.000 -0.046 
2 -0.039 -0.000 -1.000 
3 -1.000 1.000 -1.000 
4 1.000 1.000 -1.000 
5 1.000 1.000 1.000 
6 -1.000 -1.000 -1.000 
7 1.000 -1.000 -1.000 
8 -1.000 1.000 1.000 
9 1.000 -0.046 0.001 
10 -1.000 -1.000 1.000 
11 0.081 -1.000 -0.037 
12 1.000 -1.000 1.000 


The points shown here were generated from a particular run of the algorithm. Since the 
initial design depends on a randomly chosen starting point, your design may vary 
slightly from the design shown here. However, your design should share several 
characteristics with this one. First, notice that most values appear to be very close to 
one of three values: —1, 0, or +1. For the purposes of conceptual discussion, we can act 
as if the values were rounded to the nearest integer. We can see that the design includes 
the eight corners of the design space (the runs where all values are either —1 or +1). The 
design also includes three points that are face centers (runs where two values are near 


0), and one edge point (where only one value is near 0). 
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This design will allow you to estimate all first- and second-order effects in your model. 
Of course, you will not have as much precision as you would if you had used a Box- 
Behnken or central composite design, because you do not have as much information to 
work with. You also lose some of the other advantages of the standard designs, such as 
rotatability. However, because the design is optimized with respect to generalized 
variance of parameter estimates, you will be getting as much information as you can 
out of your 12 runs. 
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Discriminant Analysis 


Laszlo Engelman 
(revised by Sayyad Nisar Badashah and Rajesh V. Nath) 


DISCRIM provides classical and robust discriminant analysis. With both the methods, 
linear and quadratic discriminant analysis can be performed. In the classical linear 
discriminant analysis, the variables can be selected in a forward or backward stepwise 
manner, either interactively by the user or automatically by SYSTAT. For the latter, at 
each step, SYSTAT enters the variable that contributes most to the separation of the 
groups (or removes the variable that is the least useful). 

The command language allows you to emphasize the difference between specific 
groups; contrasts can be used to guide variable selection. Cases can be classified even 
if they are not used in the computations. 

Discriminant analysis is related to both multivariate analysis of variance and 
multiple regression. The cases are grouped in cells like a one-way multivariate 
analysis of variance and the predictor variables form an equation like that for multiple 
regression. In discriminant analysis, Wilks's lambda, the same test used in 
multivariate ANOVA, is used not just to test multivariate differences among groups, 
but also to explore: 


ш which variables are most useful for discriminating among groups; 
= ifone subset of variables performs equally well as another; 
ш which groups are most alike and most different. 


When the data sets are suspected to contain outliers, you can request linear or 


quadratic robust discriminant analysis. With robust discriminant analysis, you can 


save the robust distances, Mahalanobis distance, weights, and predicted group 
memebership. 

Resampling procedures are avai 
RDISCRIM module. 


lable in the DISCRIM module but not in the 
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Statistical Background 


When we have categorical variables in a model, it is often because we are trying to 
classify cases; that is, what group does someone or something belong to? For example, 
we might want to know whether someone with a grade point average (GPA) of 3.5 and 
an Advanced Psychology Test score of 600 is more like the group of graduate students 
successfully completing a Ph.D. or more like the group that fails. Or, we might want to 
know whether an object with a plastic handle and no concave surfaces is more like a 
wrench or a screwdriver. 

Once we attempt to classify, our attention turns from parameters (coefficients) in a 
model to the consequences of classification. We now want to know what proportion of 
subjects will be classified correctly and what proportion incorrectly. Discriminant 
analysis is one method for answering these questions. 


Linear Discriminant Model 


If we know that our classifying variables are normally distributed within groups, we 
can use a classification procedure called linear discriminant analysis (Fisher, 1936). 
Before we present the method, however, we should warn you that the procedure 
requires you to know that the groups share a common covariance matrix and you must 
know what the covariance matrix values are. We have not found an example of 
discriminant analysis in the social Sciences where this was true, The most appropriate 
applications we have found are in engineering, where a covariance matrix can be 
deduced from physical measurements, Discriminant analysis is used, for example, in 
automated vision systems for detecting objects on moving conveyer belts. 

Why do we need to know the covariance matrix? We are going to use it to calculate 
Mahalanobis distances (developed by the Indian Statistician Prasanta Chandra 
Mahalanobis). These distances are calculated between cases we want to classify and 
the center of each group in a multidimensional space. The closer a case is to the center 
of one group (relative to its distance to other groups), the more likely it is to be 
classified as belonging to that group. The figure below shows what we are doing. 

The borders of this graph comprise the two predictors GPA and GRE. The two 
“hills” are centered at the mean values of the two groups (No Ph.D. and Ph.D.). Most 


of the data in each group are Supposed to be under the highest part of each hill, The 


hills, in other words, mathematically Tepresent the concentration of data values in the 
scatterplot beneath. 
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The shape of the hills was computed from a bivariate normal distribution using the 
covariance matrix averaged within groups. We have plotted this figure this way to 
show you that this model is like a pie-in-the-sky if you use the information in the data 
below to compute the shape of these hills. As you can see, there is a lot of smoothing 
of the data going on, and if one or two data values in the scatterplot unduly influence 
the shape of the hills above, you will have an unrepresentative model when you try to 
use it on new samples. 

How do we classify a new case into one group or another? Look at the figure again. 
The “new case” could belong to one or the other group. It is more likely to belong to 
the closer group, however. The simple way to find how far this case is from the center 
of each group would be to take a direct walk from the new case to the center of each 


group in the data plot. 


Model tor NES 


Model tor мә 


pare 
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Instead of walking іп sample data space below, however, we must climb the hills of our 
theoretical model above when using the normal classification model. In other words, 
we will use our theoretical model to calculate distances. The covariance matrix we 
used to draw the hills in the figure makes distances depend on the direction we are 
heading. The distance to a group is thus proportional to the altitude (not the horizontal 
distance) we must climb to get to the top of the Corresponding hill. 

Because these hills can be oblong in shape, it is possible to be quite far from the top 
of the hill as the crow flies, yet have little altitude to cover in a climb. Conversely, it is 
Possible to be close to the center of the hill and have a steep climb to get to the top. 
Discriminant analysis adjusts for the covariance that causes these eccentricities in hill 
shape. That is why we need the covariance matrix in the first place. 

So much for the geometric representation. What do the numbers look like? (See Hill 
and Engleman (1992)). Let us look at how to set up the problem with SYSTAT. 


The input is: 


DISCRIM 
USE ADMIT 
PLENGTH LONG 
MODEL PHD - GRE,GPA 


ESTIMATE 
The output is: 
Group Frequencies 
0 1 
51 29 
Group Means 
i 0 1 
Ен ел. %-----........ ... 
GRE | 590.490 643.448 
GPA | 4.423 4.639 


Pooled Within Covariance Matrix df : 78 


GRE | 4512.409 
GPA | 1.543 0.095 


Within Correlation Matrix 
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Total Covariance Matrix df : 79 


i GRE GPA 
TeSa а-аа 
GRE | 5111.610 
GPA | 4.201 0.104 


Total Correlation Matrix 


' GRE GPA 

— 4---------:---- 
GRE | 1.000 

GPA | 0.182 1.000 


Between Groups F-matrix df : 2 77 


0.000 
9.469 0.000 


Wilks's Lambda 


Lambda i 0.803 
df : (2,1,78) 
Approx. F-ratio : 9.469 
df 2. (2,77) 
p-value - 0.000 


Classification Functions 


i 0 1 
unu 2.25 4-----<-------------- 
CONSTANT | -133.910 -150.231 
GRE 0.116 0.127 
GPA 44.818 46.920 
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance 
СИЕ је зе ПОВЕДЕ: ESSE макга Dei cid edat ес M LIU ier 
5 GRE 8.901 0.994 | 
2 GPA 6.620 0.994 | 


Classification Matrix (Cases in row categories classified into columns) 


| 0 1  $correct 
—— %-ее-<------«-<з---- 
0 | 38 13 75 
1 1 442 22 76 
Total | 45 35 75 


Jackknifed Classification Matrix 


1.0 1 %соггесі 
epee mes жаса ЈЕ АВД Нин 
0 к 37 14 73 
1 etb, 22 76 
Total ! 44 36 74 
Eigenvalues 
0.246 
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Cumulative Proportion of Total Dispersion 


Test Statistic 


Statistic Approx. F-ratio df P-value 
Wilks's Lambda 9.469 2 77 0.000 
Pillai's Trace 9.469 2 77 0.000 

9.469 2 77 0.000 


Constant | 


GRE 
GPA 


Canonical Discriminant Functions : Standardized by Within Variances 


0 
1 


There is a lot to follow on this output. The counts and means per group are shown first. 
Next comes the Pooled within covariance matrix, computed by averaging the separate- 


Е . The Total covariance matrix 
ignores the groups. It includes variation due to the group separation. These are the 
same matrices found in the MANOVA output with PLENGTH LONG The Between 


Р tion. We compute the predicted 
value of each equation for a case's values on GPA and GRE and classify the case into 
the group whose function yields the largest value. 


Next come the Separate F-ratio for each variable and the Classification matrix. The 
goodness of classifi ң 


cation is comparable to that for the PROBIT model. We did a little 
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sample, leaving out single cases to classify the remainder. There is no substitute for 
trying the model on new data. 

Finally, the program prints the same information produced by SYSTAT’s 
MANOVA. The multivariate test statistics show the groups to be significantly different 
on GPA and GRE taken together. 


Linear Discriminant Function 


We mentioned in the last section that the canonical coefficients are like a regression 
equation for computing distances up the hills. Let us look more closely at these 
coefficients. The following figure shows the plot underlying the surface in the last 
figure. Superimposed at the top of the GRE axis are two normal distributions centered 
at the means for the two groups. The standard deviations of these normal distributions 
are computed within groups. The within-group standard deviation is the square root of 
the diagonal GRE variance element of the residual covariance matrix (4512.409). The 
same is done for GPA on the right, using the square root of the within-groups variance 
(0.095) for the standard deviation and the group means for centering the normals. 


GPA 


400 500 600 700 800 
GRE 


Either of these variables separates the groups somewhat. The diagonal line underlying 
the two diagonal normal distributions represents a linear combination of these two 
variables. It is computed using the canonical discriminant functions in the output. 
These are the same as the canonical coefficients produced by GLM. Before applying 
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these coefficients, the variables must be standardized by the within-group standard 
deviations. Finally, the dashed line perpendicular to this diagonal cuts the observations 
into two groups: those to the left and those to the right of the dashed line. 

You can see that this new canonical variable and its perpendicular dashed line are 


dashed line and N’s to the right are misclassifications. Try rotating these axes any other 
Way to get a better count of correctly classified cases (watch out for ties). The linear 


Using this linear discriminant function variable, we get the same classifications we 
got with the Mahalanobis distance method. Before computers, this was the preferred 
method for classifying because the computations are simpler. 


We just use the equation: 


Е,- 0.635%2срд + 0.727*ZGrE 


The two Z variables are the Taw scores minus the overall mean divided by the within- 
groups standard deviations, If Е, is less than 0, classify No Ph.D.; otherwise, classify 


Prior Probabilities 


Our sample contained fewer Ph.D.s than No Ph.D.s. If we want to use our discriminant 
model to classify new cases and if we believe that this difference in sample sizes 
reflects proportions in the population, then we can adjust our formula to favor No 
Ph.D.s. In other Words, we can make the prior probabi 
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either by selecting the Priors option available in the dialog box and specifying the 
values of the prior probabilities or by adding 


PRIORS = 0.625, 0.375 


to the MODEL command. In fact you can specify any set of numbers лу, nz, ..., ny (К 
being the number of groups) to indicate these prior probabilities (proportional to л, п, 
..., hj). If you believe that the sample proportions indicate the prior probabilities 
satisfactorily, then you can let the system work it out. Do not be tempted to use this 
method as a way of improving your classification table. If the probabilities you choose 
do not reflect real population differences, then new samples will, on the average, be 
classified worse. It would make sense in our case because we happen to know that 
more people in our department tend to drop out than stay for the Ph.D. 


Multiple Groups 


The discriminant model generalizes to more than two groups. Imagine, for example, 
three hills in the first figure. All the distances and classifications are computed in the 
same manner, The posterior probabilities for classifying cases are computed by 
comparing three distances rather than two. 

The multiple group (canonical) discriminant model yields more than one 
discriminant axis. For three groups, we get two sets of canonical discriminant 
coefficients. For four groups, we get three. If we have fewer variables than groups, then 
we get only as many sets as there are variables. The group classification function 
coefficients are handy for classifying new cases with the multiple group model. Simply 
multiply each coefficient times each variable and add in the constant. Then assign the 
case to the group whose set yields the largest value. 


Robust Discriminant Analysis 


In classical discriminant analysis, the discriminant function is constructed using 
maximum likelihood estimates of group mean vectors and covariance matrices. 
However, these estimates are highly sensitive to outliers and consequently the 
discriminant rule may perform poorly. To overcome this problem, a robust 
discriminant function is obtained by using the (reweighted) minimum covariance 
determinant (MCD) estimates of the mean vector and covariance matrices in place of 
the classical ones (Rousseeuw, 1984, 1985). The MCD estimates are computed using 
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the FAST -MCD algorithm of Rousseeuw and Van Driessen (1999), See Hubert and 
Van Driessen (2004) for more details about robust discriminant analysis. 


Discriminant Analysis in SYSTAT 


Classical Discriminant Analysis Dialog box 


To open the Classical Discriminant Analysis dialog box, from the menus choose: 


Analyze 
Discriminant Analysis 
Classical... 


7 Analyze:Discriminant Analysis:Classical [? fs] 


Available variable(s]: Grouping variable: 


SPECIES Add -> «Required» 
SEPALLEN T Е 
SEPALWID o ____________ 
PETALLEN Predictor(s): 
PETALWID | > «Required» 

Add-» | 


~ Remove | 


[0 Quadratic 


Priors 
ба groups equal 
© Compute from group sizes 


| O Specified values: TR my o 
КЕЗЕН ——3 
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The following options can be specified: 
Grouping variable. Select one categorical variable as the grouping variable. 
Predictor(s). Select one or more predictor variables. 


Quadratic. The Quadratic check box requests quadratic discriminant analysis. If not 

selected, linear discriminant analysis is performed. 

Priors. SYSTAT provides three options for specifying prior probabilities: 

m All groups equal. It is the default option and it assigns equal prior probabilities to 
the groups. 


= Compute from group sizes. For each group, this computes and assigns prior 
probabilities proportional to the sample size of that group. 


= Specified values. User can specify prior probabilities to the groups. 


Save. For each case, Distances saves the Mahalanobis distances to each group centroid 
and the posterior probability of the membership in each group. Scores saves the 
canonical variable scores. Scores /Data and Distances/Data save scores and distances 


along with the data. 


Options 


With Classical Discriminant Analysis, several controls for stepwise model building 
and tolerance are available. To access these options, click the Options tab in the dialog 


box. 
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* Analyze:Discriminant Analysis:Classical [2] х] 


“Model! Options Statistics || Resa 


Tolerance: Stepwise options: 


^ Direction — — r Control 
~ Estimation ~ © Backward © Automatic 
| | 
| © Complete (О Forward O Interactive 


| @ Stepwise Enter — Remove | 


Number 


The following can be specified: 


Tolerance. The tolerance sets the matrix inversion tolerance limit. Default value is 
0.001. 


Two estimation options are available: 
m Complete. All variables are used in the model. 


н Stepwise. Variables can be selected in a forward or backward stepwise manner, 
either interactively by the user or automatically by SYSTAT. 


If you select stepwise estimation, you can Specify the direction in which the estimation 
should Proceed, whether SYSTAT should control variable entry and elimination, and 
any desired criteria for variable entry and elimination, 


Ш Backward. In backward Stepping, all variables are entered, irrespective of their F- 
to-enter values (if a variable fails the Tolerance limit, however, it is excluded). F- 
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to-remove and F-to-enter values are reported. When Backward is selected along 
with Automatic, at each step, SYSTAT removes the variable with the lowest F-to- 
remove value that passes the Remove limit of the F statistic (or reenters the 
variable with the largest F-to-enter above the Remove limit of the F statistic). 

m Forward. In forward stepping, the variables are entered in the model. F-to-enter 
values are reported for all candidate variables, and F-to-remove values are reported 
for forced variables. When Forward is selected along with Automatic, at each step, 
SYSTAT enters the variable with the highest F-to-enter that passes the Enter limit 
of the F statistic (or removes the variable with the lowest F-to-remove below the 


Remove limit of the F statistic). 

m Automatic. SYSTAT enters or removes variables automatically. F-to-enter and F- 
to-remove limits are used. 

m Interactive. Variables are interactively removed from and/or added to the model at 
each step. In the Command pane, type a STEP command to enter and remove 
variables interactively. 


STEP One variable is entered into or removed from the model (based 
on the Enter and Remove limits of the F statistic). 

STEP + Variable with the largest F-to-enter is entered into the model 
(irrespective of the Enter limit of the F statistic). 

STEP – Variable with the smallest F-to-remove is removed from the 
model (irrespective of the Remove limit of the F statistic). 

STEP с,е Variables named с and е are stepped into/out of the model (irre- 
spective of the Enter and Remove limits of the F statistic). 

STEP 3,5 Third and fifth variables are stepped into/out of the model (irre- 


spective of the Enter and Remove limits of the F statistic). 
STEP/NUMBER = 3 Three variables are entered into or removed from the model. 


STOP Stops the stepping and generates final output (classification 
matrices, eigenvalues, canonical variables, etc.). 


Variables are added to or eliminated from the model based on one of two possible 

criteria. 

m Probability. Variables with probability (F-to-enter) smaller than the Enter 
probability are entered into the model if Tolerance permits. The default Enter value 
is 0.15. For highly correlated predictors, you may want to set Enter 7 0.01. 
Variables with probability (F-to-remove) larger than the Remove probability are 
removed from the model. The default Remove value is 0.15. 


m F-statistic. Variables with F-to-enter values larger than the Enter F value are 
entered into the model if Tolerance permits. The default Enter value is 4. Variables 
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with F-to-remove values smaller than the Remove F value are removed from the 
model. The default Remove value is 3.9, 


You can also specify variables to include in the model, regardless of whether they meet 
the criteria for entry into the model. In the Force text box, enter the number of 


(for example, Force = 2 means include the first two variables on the Variables list in the 


Statistics 


When you perform the classical discriminant analysis, you can select any desired 
output elements by clicking Statistics in the dialog box. 


" Analyze:Discriminant Anal 


ysis:Classical 


ЕЗ 


Short statistics 


[v] [V] Eigen (м Sum 
| [V] EStats [V] CMeans 
r Medium statistics 
| [У] Means Traces [V] Class 
[V] wiks EDFunc [V] JClass 
| V] CFunc SCDFunc 
Long statistics 
ETwCov E]ICor C Mahal 
ETwtCor GCov [0 Сбсоге 


Ташу 
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АП selected statistics will be displayed in the output. Depending on the specified length 
of your output, you may also see additional statistics. By default, the plength length is 
set to Short (you will see all of the statistics on the Short statistics list). To change the 
length of your output, choose Options from the Edit menu. Select Short, Medium, or 
Long from the Length drop-down list. Again, all selected statistics will be displayed in 
the output, regardless of the print setting. 


Short statistics. Options for Short statistics are FMatrix (between-groups F matrix), 
FStats (F-to-enter/remove statistics), Eigen (eigenvalues and canonical correlations), 
CMeans (canonical scores of group means), and Sum (summary panel for stepwise 
discriminant analysis). 


Medium statistics. Options for Medium statistics are those for Short statistics plus 
Means (group frequencies and means), Wilks (Wilks’s lambda and approximate F), 
CFunc (discriminant functions), Traces (Lawley-Hotelling, Pillai and Wilks’s traces), 
CDFunc (canonical discriminant functions), SCDFunc (standardized canonical 
discriminant functions), Class (classification matrix), and JClass (Jackknifed 
classification matrix). 

Long statistics. Options for Long statistics are those for Medium statistics plus WCov 
(within covariance matrix), WCorr (within correlation matrix), TCov (total covariance 
matrix), TCorr (total correlation matrix), GCov (groupwise covariance matrix), and 
GCorr (groupwise correlation matrix). 


Mahalanobis distances, posterior probabilities (Mahal), and canonical scores (CScore) 
for each case must be specified individually. 


Robust Discriminant Analysis Dialog Box 


To obtain the Robust Discriminant Analysis dialog box, from the menus choose 


Analyze 
Discriminant Analysis 
Robust... 


1-406 
Chapter 11 


R Analyze:Discriminant Ana lysis:Robust [? |] 


Available variable(s]: Grouping variable: 


2 [c Remove] 


Es Predictor(s): 


«Required» 


| “Add > j } 


| «- Remove | 


- (0 Quadratic 


Сз» (Due 


Grouping variable. Select one categorical variable as the grouping variable. 


Predictor(s). Select one or more predictor variables. 


Quadratic. The Quadratic check box requests a robust quadratic discriminant analysis. 
If not selected, robust linear discriminant analysis is performed. 


Save. The following alternatives are available: 
m Distances. Saves predicted group membership, 
robust distances, misclassification variable and 


Ш Distances/Data. Saves the cont 
variables in the model, includi 


classical Mahalanobis distances, 
weights, 


ents given by the Distances statement plus all the 
ng any transformed values. 
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Using Commands 


For Classical Discriminant Analysis 


Select your data by typing USE filename and continue as follows: 


Basic DISCRIM 
MODEL grpvar = varlist / QUADRATIC PRIORS=n1,n2,... 
CONTRAST [matrix] 
PLENGTH / length element 
ЗАМЕ / DATA SCORES DISTANCES 
ESTIMATE / TOL=n SAMPLE= BOOT(m,n)or JACK or 
SIMPLE (m,n) 


Stepwise (Instead of ESTIMATE, specify START) 


START / FORWARD TOL=n ENTER=p REMOVE=p FENTER=n 
FREMOVE=n FORCE=n BACKWARD 


STEP no argument or / NUMBER-n AUTO ENTER=p REMOVE=p 
FENTER-n  FREMOVE-n 


* or 

- or 

varlist or 

nvari, nvarj, .. 
sequence of STEPS) 


STOP 


In addition to indicating a length for the PLENGTH output, you can select elements not 
included in the output for the specified length. Elements for each length include: 


Length Element 

SHORT FMATRIX FSTATS EIGEN CMEANS SUM CLASS JCLASS 
MEDIUM MEANS WILKS CFUNC TRACES CDFUNC SCDFUNC 
LONG WCOV WCOR TCOV TCOR GCOV GCOR 


MAHAL and CSCORE must be specified individually. No length specification includes 
these statistics. 
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For Robust Discriminant Analysis 


RDISCRIM 
USE FILENAME 
MODEL gprvar = varlist / QUADRATIC 
PLENGTH length 
SAVE /DISTANCES, DATA 
ESTIMATE 


Usage Considerations 


Types of data. DISCRIM and RDISCRIM use rectangular data only, 


Print options. Plength options allow the user to select panels of output to display, 


including group means, variances, covariances, and correlations. 


Quick Graphs. For two canonical variables, SYSTAT produces a canonical scores plot, 
in which the axes are the canonical variables and the points are the canonical variable 


Saving files. With DISCRIM, you can save the Mahalanobis distances to each group 
centroid (with the posterior probability of the membership in each group) or the 
canonical variable scores, With RDISCRIM, you can save robust distances, weights, 
predicted group membership, misclassification variable, etc., along with the original 


data. 
BY groups. DISCRIM and RDISCRIM analyse data by groups. 


Case frequencies. DISCRIM and RDISCRIM use a FREQ variable to increase 
of cases. 


Case weights. You can weight each case in a c 


lassical discriminant analysis using a 


weight variable. Use a binary weight variable coded 0 and 1 for cross-validation. Cases 
that have a zero weight do not influence the estimation of the discriminant functions 


but are classified into groups. Case weights is not available in RDISCRIM. 
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Examples 


Example 1 
Discriminant Analysis Using Complete Estimation 


In this example, we examine measurements made on 150 iris flowers: sepal length, 
sepal width, petal length, and petal width (in centimeters). The data are from Fisher 
(1936), and are grouped by species: Setosa, Versicolor, and Virginica (coded as 1, 2, 
and 3, respectively). 

The goal of the discriminant analysis is to find a linear combination of the four 
measures that best classifies or discriminates among the three species (groups of 
flowers). Here is a SPLOM of the four measures with within-group bivariate confidence 
ellipses and normal curves. 


The input is: 
DISCRIM 
USE IRIS 


SPLOM sepallen..petalwid / HALF GROUP-species ELL, 
DENSITY=NORM OVERLAY 


The output is: 


SEPALLEN 


SEPALLEN 


SEPALWID 


PETALLEN 


SPECIES 
m: 
m 
из 


OM Wiad 


PETALWID 


SEPALLEN SEPALWD PETALLEN те 
Let us see what a default analysis tells us about the separation of the groups and the 
usefulness of the variables for the classification. 
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The input is: 


USE IRIS 


LABEL SPECIES / 1="Setosa", 2="Versicolor", 3="Virginica" 
DISCRIM 

MODEL SPECIES = SEPALLEN. . PETALWID 

PLENGTH / MEANS CLASS JCLASS 

ESTIMATE 


Note the shortcut notation (..) in the MODEL statement for listing consecutive variables 
in the file (otherwise, simply list each variable name separated by a space). 


The output is: 
Group Frequencies 


Setosa Versicolor Virginica 


] 

bipes а ok ots wish RUE BUG e yr d ie gore 
SEPALLEN | 5.0060 5.9360 6.5880 
SEPALWID | 3.4280 2.7700 2.9740 
PETALLEN | 1.4620 4.2600 5.5520 
PETALWID | 0.2460 1.3260 2.0260 


Between Groups F-matrix 


df : 4 144 
Н Setosa Versicolor Virginica 
masmas Sint + a 
Setosa i . 
Versicolor | 550.1889 0.0000 
Virginica | 1098.2738 105.3127 0.0000 
Variable < ! Variable F-to-enter Tolerance 
SOS Gi NN Mcd Бына ырызы сыл шапса 
2 SEPALLEN i 
3 SEPALWID i 
4 PETALLEN i 
5 PETALWID ! 


Classification Matrix (Cases in гом categories Classified into columns) 


| Setosa Versicolor Virginica tcorrect 
------------ КИЕК” ушар утлен эс, ке ае eee 
Setosa i 50 0 0 100 
Versicolor | 0 48 2 96 
Virginica | 0 1 49 98 
Total i 50 49 51 98 


Jackknifed Classification Matrix 


Setosa 

Versicolor ! 

Virginica 
Total 


------------ + 
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Eigenvalues 


32.1919 0.2854 


Canonical Correlations 


0.9848 0.4712 


Cumulative Proportion of Total Dispersion 


0.9912 1.0000 


i 1 2 

EENE ONS ran p 
Setosa | 7.6076 0.2151 

Versicolor ! -1.8250 -0.7279 


Virginica -5.7826 0.5128 


Canonical Scores Plot 


10 erc A ai 


& 

[74 

О 

Е 

О 

< 

ш 
SPECIES 

О Setosa 
Versicolor 
| Virginica 


-10 5 5 10 


0 
FACTOR(1) 


Group Frequencies 


The Group frequencies panel shows the count of flowers within each group and the 
means for each variable. If the group code or one or more measures are missing, the 


case is not used in the analysis. 


Between Groups F-Matrix 


se these F-ratio values to test the equality of group means. 


For each pair of groups, U: 
1 to distance measures and are computed from 


These values are proportiona 
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Mahalanobis D° statistics. Thus, the centroids for Versicolor and Virginica are closest 
(105.3); those for Setosa and Virginica (1098.3) are farthest apart. If you explore 
differences among several pairs, do not use the probabilities associated with these F’s 
as a test because of the simultaneous inference problem. Compare the relative size of 
these values with the distances between group means in the canonical variable plot. 


F Statistics and Tolerance 


Use F-to-remove statistics to determine the relative importance of variables included 
in the model. The numerator degrees of freedom (df) for each F is the number of groups 
minus 1, and the denominator df is the (total sample size) — (number of groups) 
(number of variables in the model) + 1; for example, for these data, 3 — 1 and 150—3 
-4+ 1, or2 and 144. Because you may be scanning F’s for several variables, do not 
use the probabilities from the usual F tables for a test. Here we conclude that 
SEPALLEN is least helpful for discriminating among the species (F = 4.72). 


Classification Tables 


In the Classification matrix, each case is classified into the group where the value of 
its classification function is the largest. For Versicolor (row name), 48 flowers are 
classified correctly and 2 are misclassified (classified as Virginica)—96% of the 
Versicolor flowers are classified correctly. Overall, 98% of the flowers are classified 
correctly (see the last row of the table). The results in the first table can be misleading 
because we evaluated the classification rule using the same cases used to compute it. 
They may provide an overly optimistic estimate of the rule's success. The Jac ‘kknifed 
classification matrix attempts to remedy the problem by using functions computed 
from all of the data except the case being classified. The method of leaving out one case 
at a time is called the jackknife and is one form of cross-validation. 

For these data, the results are the same. If the percentage for correct classification is 
considerably lower in the Jackknifed panel than in the first matrix, you may have too 


Eigenvalues, Canonical Correlations, Cumulative Proportion of Total Dispersion, 
and Canonical Scores of Group Means 


The first canonical variable is the linear combination of the variables that best 
discriminates among the groups, the second canonical variable is orthogonal to the first 
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and is the next best combination of variables, and so on. For our data, the first 
eigenvalue (32.191) is very large relative to the second, indicating that the first 
canonical variable captures most of the difference among the groups - notice that it 
accounts for more than 99% of the total dispersion of the groups (in the “Cumulative 
Proportion of Total Dispersion”). 

The Canonical correlation between the first canonical variable and a set of two 
dummy variables representing the groups is 0.985; the correlation between the second 
canonical variable and the dummy variables is 0.471. (The number of dummy variables 
is the number of groups minus 1.) Finally, the canonical variables are evaluated at the 
group means. That is, in the canonical variable plot, the centroid for the Setosa flowers 
is (7.608, 0.215), Versicolor is (-1.825, —0.728), and so on, where the first canonical 
variable is the x coordinate and the second, the y coordinate. 


Canonical Scores Plot 


The axes of this Quick Graph are the first two canonical variables, and the points are 
the canonical variable scores. The confidence ellipses are centered on the centroid of 
each group. The Setosa flowers are well differentiated from the others. There is some 
overlap between the other two groups. Look for outliers in these displays because they 


can affect your analysis. 


Example 2 К 
Discriminant Analysis Using Automatic Е. orward Stepping 


Our problem for this example is to derive a rule for classifying countries as European, 
Islamic, or New World. We know that strong correlations exist among the candidate 
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predictor variables, so we are curious about just which subset will be useful. Here are 
the candidate predictors: 


URBAN Percentage of the population living in cities 
BIRTH_RT Births per 1000 people in 1990 

DEATH RT Deaths per 1000 people in 1990 

B TOD Ratio of births to deaths in 1990 

BABYMORT Infant deaths during the first year per 1000 live births 
GDP CAP Gross domestic product per capita (in U.S. dollars) 
LIFEEXPM Years of life expectancy for males 

LIFEEXPF Years of life expectancy for females 

EDUC U.S. dollars spent per person on education in 1986 
HEALTH U.S. dollars spent per person on health in 1986 

MIL U.S. dollars spent per person on the military in 1986 
LITERACY Percentage of the population who can read 


Because the distributions of the economic variables are skewed with long right tails, 
we log transform GDP_CAP and take the Square root of EDUC, HEALTH, and MIL. 
LET GDP CAP = L10(GDP САР) 


LET EDUC = SQR (EDUC) 
LET HEALTH = SQR (HEALTH) 
LET MIL - SQR(MIL) 


Alternatively, we could also use the shortcut notation to request the square root 
transformations: 


LET (EDUC, HEALTH, MIL) - SQR (а) 


We use automatic forward Stepping in an effort to identify the best subset of predictors. 
After stepping Stops, we need to type STOP to ask SYSTAT to produce the summary 
table, classification matrices, and information about canonical variables 
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DISCRIM 
USE OURWORLD 
LET GDP_CAP = L10 (GDP CAP) 


LET (EDUC, HEALTH, MIL) = SQR(@) 


LABEL GROUP / 1="Europe", 2="Islamic", 
MODEL GROUP = URBAN BIRTH_RT DEATH_RT BABYMORT, 
GDP_CAP EDUC HEALTH MIL B_TO_D, 
LIFEEXPM LIFEEXPF LITERACY /PRIORS=SAMPLE 
PLENGTH SHORT/ MEANS CLASS JCLASS 
START / FORWARD 
STEP / AUTO FENTER=4 FREMOVE-3.9 
STOP 


Discriminant Analysis 


3="NewWorld" 


Notice that the initial results appear after START / FORWARD is specified. STEP / AUTO 
and STOP are selected later. 


The output is: 


Group Frequencies 


Europe 


Islamic NewWorld 


URBAN 
BIRTH_RT 
DEATH_RT 
BABYMORT 
GDP_CAP 
ЕРОС 
HEALTH 
MIL 

B TO D 
LIFEEXPM 
LIFEEXPF 
LITERACY 


Europe Islamic NewWorld 


.7895 30.0667 56.3810 
.5789 42.7333 26.9524 


+1053 13.4000 7.4762 
.8947 102.3333 42.8095 
0431 2.7640 3.2139 
.5275 6.4156 8.9619 
+9537 3.1937 6.8898 
9751 7.5431 6.0903 
+2658 3.5472 3.9509 
. 3684 54.4000 66.6190 


.5263 57.1333 71.5714 
„5263 36.7333 79.9571 


Tolerance | 


23. 
103. 
14. 
53. 
59. 
27. 
49. 
19. 


Variable 

6 URBAN 

8 BIRTH RT 
10 DEATH RT 
12 BABYMORT 
16  GDP CAP 
19 EDUC 
21 HEALTH 
23 MIL 
34 B TO D 
30 LIFEEXPM 
31 LIFEEXPF 
32 LITERACY 


F-to-enter 


31 


1981 
4953 
4091 
6151 
1163 
1207 
6216 
2950 


.5359 
37. 
50. 
63. 


0774 
2980 
6450 


То1егапсе 


me 
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Using commands, type STEP/AUTO. 


The output is: 
I te Step 1 -- Variable BIRTH RT Entered “кок ыы. 
Between Groups F-matrix df : 1 52 


i Ботове Islamic NewWorld 


Europe Н 0. 0000. 
Islamic | 206.5877 0.0000 
NewWorld | 55.8562 59.0625 0.0000 
Variable F-to-remov bites Variable F-to-enter Tolerance 
8 BIRTH_RT 103.4953 17 0000 Н 6 URBAN 1.2603 0.7246 
1 10 DEATH RT 19.4059 0.6861 
i 12  BABYMORT 2.1266 0.4438 
! 16 СОР CAP 4.5571 0.5814 
\ 19 .EDUC 5.1239 0.8314 
i 21 HEALTH 9.5179 0.8686 
i 23 MIL 8.5505 0.9075 
i 34 B_TO_D 14.9368 0.9880 
i 30 LĪFEĒXPM 4.3098 0.4379 
i 31 LIFEEXPF 3.5780 0.3716 
i 32 LITERACY 10.3191 0.3246 
te Step 2 -- Variable DEATH RT Entered *##sessrennneeee 
Between Groups F-matrix df : 251 
{ уЁгоре Islamic NewWorld 
Н 
Islamic 120.1297 0.0000 
NewWorld | 59.7595 29.7661 0.0000 
Уагіар1е Е ro-remqve | Variable F-to-enter Tolerance 
SS OO ae пи i ra se se. Prom oia A lt ics cit ema en a cimi ы к кз кя а cuc 
8 BIRTH RT 118, 4057 | 6 URBAN 0.0695 0.6944 
10 DEATH RT 19.4059 0.6861 } 12 BABYMORT 1.8295 0.2796 
16 СОР CAP 7.8755 0.5208 
19 EDUC 5.0296 0.8126 
21 HEALTH 6.4710 0.8642 
23 MIL 13.2124 0.7896 
34 B TOD 0.8163 0.1861 
30 ІЛҒЕЕХРМ 3.3363 0.1582 
31  LIFEEXPF 5.1954 0.1205 
32 LITERACY 2.2239 0.2653 


СССРА Step 3 


77 Variable MIL Entered »»%.»........... 
Between Groups F-matrix df : 3 50 


Europe 
Islamic 80.7600 0.0000 
NewWorld | 55.6502 24.6740 0.0000 


Variable F-to-remove Tolerance 
8 BIRTH RT 77.8483 0.6831 | 
10 DEATH RT 25.3945 0.5969 
23 MIL 13.2124 0.7896 


When using commands, type STOP. 


The output is: 
Stepping Summary 


-rem) Wilks's Lambda 


! 
---------- + 
BIRTH RT | 103.4953 0.2008 
DEATH RT } 19.4059 0.1140 
MIL ! 13.2124 0.0746 


Classification Matrix (Cases іп row categories classified into columns) 


| Europe 
Spee Киз ди! Сей саза т куске эшш 
Europe Н 19 0 0 
Islamic | 0 13 2 
NewWorld ; 2 1 18 
Total ! 21 14 20 


Jackknifed Classification Matrix 


| Europe 
t ee Hs ica e ере н=не 
Europe | 19 о 0 
Islamic ! 0 12 3 
NewWorld | 1 18 
Тоса1 i 21 13 21 
Eigenvalues 


Canonical Correlations 


0.9165 0.7308 


Cumulative Proportion of Total Dispersion 


0.8207 1.0000 


Canonical Scores of Group Means 


[ 1 2 

БЕНД cis pe сња --------- 
Europe | -2.9381 0.4091 

Islamic } 2.4807 1.2431 


NewWorld 0.8864 -1.2581 


І-417 


Variable 

6 URBAN 

12 BABYMORT 
16 СОР CAP 
19 EDUC 
21 HEALTH 
за втор 
30 LIFEEXPM 
31 LIFEEXPF 
32 LITERACY 


Approx. F-ratio 


Islamic NewWorld %correct 


Islamic NewWorld ‘%correct 


103.4953 
50.0200 
44.3576 


Discriminant Analysis 


F-to-enter 


1.3424 
3.5137 


0.0000 


То1егапсе 
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Canonical Scores Plot 


S 
бо 
2 
A | 
2 GROUP 
O Europe 
Islamic 
^ NewWorld 
4 
4 2 4 


0 2 
FACTOR(1) 


Group Frequencies and Means 


From the panel of Group means, note that, on the average, the percentage of the 
population living in cities (URBAN) is 68.8% in Europe, 30.1% in Islamic nations, and 
56.4% in the New World. The LITERACY rates for these same groups are 97.5%, 
36.7%, and 80.0%, respectively. 


Steps 


After the group means, you will find the F-to-enter Statistics for each variable not in 
the functions. When no variables are in the model, each F is the same as that for a one- 
way analysis of variance. Thus, group differences are the Strongest for BIRTH RT (F 
= 103.5) and weakest for DEATH RT(F = 14.41). At later steps, each F corresponds 
to the F fora one-way analysis of covariance where the covariates are the variables 


At step 1, SYSTAT enters BIRTH RT because its F-to-enter is the largest in the last 
panel and now displays the same F in the F-to-remove panel, BIR TH_RT is correlated 
with several candidate variables, so notice how their F- e: 
BIRTH RT enters (for example, for СРР CAP, from 59.1 to 4.6). DEATH RT now 
has the highest F-to-enter, so SYSTAT will enter it at step 2. From the between-groups 
F-matrix, note that when BIRTH RT is used alone Europe and Islamic countries are 


the groups that differ the most (206.6), and Europe and the New World are the groups 
that differ the least (55.9). 


1-419 


Discriminant Analysis 


After DEATH. RT enters, the F-to-enter for MIL (money spent per person on the 
military) is the largest, so SYSTAT enters it at step 3. The SYSTAT default limit for 
F-to-enter values is 4.0. No variable has an F-to-enter above the limit, so the stepping 
stops. Also, all F-to-remove values are greater than 3.9, so no variables are removed. 

The summary table contains one row for each variable moved into the model. The 
F-to-enter (F-to-remove) is printed for each, along with Wilks's lambda and its 
approximate F-ratio, and p-values. 


Classification Matrices 


After the summary table, SYSTAT prints the classification matrices. From the biased 
estimate in the first matrix, our three-variable rule classifies 91% of the countries 
correctly. For the jackknifed results, this percentage drops to 89%. АП ofthe European 
nations are classified correctly (100%), while almost one-fourth of the New World 
countries are misclassified (two as Europe and three as Islamic). These countries can 
be identified by using MAHAL— the posterior probability for each case belonging to 
each group is printed. You will find, for example, that Canada is misclassified as 
European and that Malaysia and Turkey are misclassified as New World. 


Canonical Results 


If you focus on the canonical results, you notice that the first canonical variable 
accounts for 82.1% of the dispersion, and in the Canonical scores of group means 


panel, the groups are ordered from left to right: Europe, New World, and then Islamic. 


The second canonical variable contrasts Islamic versus New World (1.243 versus 


-1.258). 


Canonical Variable Plot 


In the canonical variable plot, the European nations (on the left) are well separated 
from the other groups. The plus sign (+) next to the European confidence ellipse is 
Canada. If you are unsure about which ellipse corresponds to what group, look at the 


Canonical scores of group means. 
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Ехатріе 3 і 
Discriminant Analysis Using Automatic Backward Stepping 


It is possible that classification rules for other subsets of the variables perform better 
than those found using forward stepping—especially when there are correlations 
among the variables. We then try backward stepping. 


The input is: 


DISCRIM 

USE OURWORLD 

LET СЮР CAP = 110 (GDP_CAP) 

LET (EDUC, HEALTH, MIL) = SOR (а) 

LABEL GROUP / 1="Еџкоре", 2="Islamic", 3-"NewWorld" 

MODEL GROUP = URBAN BIRTH RT DEATH RT BABYMORT, 
GDP CAP EDUC HEALTH MIL B TO D, 
LIFEEXPM LIFEEXPF LITERACY/PRIORS-19 15 21 

PLENGTH SHORT / CFUNC CLASS JCLASS 

IDVAR COUNTRYS 

START / BACKWARD 

STEP / AUTO FENTER-4 FREMOVE-3.9 

PLENGTH/ TRACES CDFUNC SCDFUNC 

STOP 


Notice that we request STEP after an initial report and PLENGTH and STOP later. 


The output is: 


Between Groups F-matrix 
df : 12 41 


Europe Islamic NewWorld 


Europe 0.0000 
Islamic 25.3059 0.0000 
NewWorld 18.0596 7.3754 0.0000 


Classification Functions 


Europe Islamic NewWorld 


URBAN -2.4175 -2.3572 -2.2871 

BIRTH_RT 41.9790 43.1675 43.1322 

DEATH_RT 50.0202 48.1539 48.1950 
9 


BABYMORT 9.3190 9.3806 «3461 
GDP CAP 243.6686 234.5165 237.0805 
EDUC 2.0078 4.0450 3.4276 
HEALTH -17.9706 -19,8527 =19. 3068 
MIL -9.8420 710.1746 -10.6076 
B TO D 759.6547 762.2446 761.8195 
LIFEEXPM -9.8216 79.1537 -9.4952 


LIFEEXPF 93.5933 93.0934 93.4108 
LITERACY 7.5909 7.5834 7.7178 
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Variable F-to-remove F-to-enter Tolerance 
6 URBAN 2.1673 0.4365 | 
8 BIRTH_RT 2.0115 0.0596 | 
10 DEATH RT 2.2580 0.0915 ; 
12 BABYMORT 0.1032 0.0840 ; 
16 GDP CAP 0.6218 0.1435 ! 
19 EDUC 6.1198 0.0651 | 
21 HEALTH 5.3565 0.0832 | 
23 MIL 7.1137 0.3235 | 
за втор 0.5532 0.1361 | 
30 LIFEEXPM 0.2617 0.0361 | 
31 LIFEEXPF 0.0675 0.0123 1 
32 LITERACY 1.4456 0.1778 | 


Using commands, type STEP/AUTO. 
Variable LIFEEXPF Removed 


хххжжже жж жж жж 4x * x 


Step 1 


Between Groups F-matrix df : 11 42 


! Islamic owners 
Europe 1 0.0000 
Islamic ! 28.2000 0.0000 
NewWorld ! 20.1693 8.2086 0.0000 
Variable F-to-remove Tolerance F-to-enter Tolerance 
6 URBAN 2.4479 0.4662 | 31  LIFEEXPF 0.0675 0.0123 
8 BIRTH RT 3.0369 0.0775 | 
10 DEATH RT 2.4523 0.1007 | 
12 BABYMORT 0.4111 0.1406 | 
16  GDP CAP 0.6778 0.1449 | 
19 EDUC 6.7110 0.0665 | 
21 HEALTH 6.7824 0.0921 | 
23 MIL 7.3879 0.3289 | 
34 B TO D 0.7029 0.1480 | 
30 LIFEEXPM 0.2372 0.0778 | 
32 LITERACY 1.4767 0.1855 | 
(We omit the output for step 2 through 6) 
wxwsexexesuexaa* Step 7 -- Variable URBAN Removed side ы 
Between Groups F-matrix df : 5 48 
| Europe Islamic NewWorld 
EXE oes ы. мыд. Еа d 
Europe + 0.0000 
Islamic ! 61.5899 0.0000 
NewWorld ! 40.9350 15.6004 0.0000 
Variable y-to^renovo Variable F-to-enter Tolerance 
pes n TH rond 27. 8886 0.6227 | 
10 DEATH RT 15.5101 0.5834 | 12  BABYMORT 1.1222 0.2437 
19 EDUC 5.2022 0.0839 | 16  GDP CAP 1.2025 0.1712 
21 HEALTH 6.6659 0.1025 | 34 B TOD 1.2428 0.1803 
23 MIL 7.4236 0.5010 | 30 LIFEEXPM 0.0157 0.1236 
i 31 LIFEEXPF 0.4856 0.0760 
H 32 LITERACY 3.4249 0.2503 
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Classification Matrix (Cases in row categories classified into columns) 


Europe Islamic NewWorld %correct 


! 

[o bM. Фани па pr) uar Ute (c d dal 
Europe Н 19 0 0 100 
Islamic ! 0 13 2 87 
NewWorld | 1 ї 19 90 

Total i 20 14 21 93 


Europe Islamic NewWorld ‘correct 


---------- + 

Europe Н 0 0 100 

Islamic ! 0 13 2 87 

NewWorld | 1 2 18 86 

Total | 20 15 20 91 

Stepping Summary 

| F(*ent,-rem)  Wilks's Lambda Approx. F-ratio p-value 

ne дневна сце а АЈДЕ SUL над ----------....... 
LIFEEXPF | -0.0675 0.0405 15.1458 0.0000 
LIFEEXPM | 70.2372 0.0410 16.9374 0.0000 
BABYMORT | -0.2190 0.0414 19.1350 0.0000 
BTOD | -0.8485 0.0430 21.4980 0.0000 
GDP CAP | 71.4294 0.0457 24.1542 0.0000 
LITERACY | -2.3877 0.0505 27.0277 0.0000 
URBAN i -3.6548 0.0583 30.1443 0.0000 
Eigenvalues 


6.9840 1.1468 


Canonical Correlations 
0.9353 0.7309 


Cumulative Proportion of Total Dispersion 
0.8590 1; 0000 


Using commands, type PLENGTH/TRACES CDFUNC SCDFUNC, the STOP. 
Test Statistic 


Statistic 
Wilke's Lambda — ^ 1 o tarr |. o LETTO aem erre ыны нч ырк ша к ne 
Pillai's Trace 1 10 

Lawley-Hotelling Trace | 8.1308 38.2148 n 60000 


Canonical Discriminant Functions 


-1.9836 -5.4022 


Constant 


URBAN : : 
BIRTH RT 0.1603 0.0414 
DEATH RT -0.1588 0.2771 
BABYMORT 2 
СОР САР Ў : 
EDUC 0.2358 0.0063 
HEALTH -0.2604 -0.0015 
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MIL -0.0736 0.1497 
B TO D 1 ; 
LIFEEXPM 
LIFEEXPF 
LITERACY 


Canonical Discriminant Functions : Standardized by Within Variances 


! 1 2 
an еее A Me. десі сес е 
URBAN t 
BIRTH RT ! 0.9737 0.2512 
DEATH RT ! -0.5188 0.9050 
BABYMORT | 
СЮР CAP |! 
EDUC | 1.5574 0.0413 
HEALTH | -1.5572 -0.0091 
MIL 1.-0.3910 0.7952 
втор | 2 
LIFEEXEM } 
LIFEEXPF | 
LITERACY | 


Canonical Scores of Group Means 


1 1 2 
ms n n 
Europe | -3.3891 0.4103 
Islamic ! 2.8644 1.2426 
NewWorld ! 1.0203 -1.2588 


Canonical Scores Plot 


~ 


м 


о 


FACTOR(2) 


-2 
GROUP 
di O Europe 
Islamic 
| NewWorld 


85 ый STO 2 4 
ЕАСТОҢ(1) 
Classification Function 


Before stepping starts, SYSTAT uses all candidate variables to compute classification 
functions. The output includes the coefficients for these functions used to classify cases 
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into groups. А variable is omitted only if it fails the Tolerance limit. For each case, 
SYSTAT computes three functions. The first is: 


-4408.365 - 2.417*urban + 41 :979*birth rt +... + 7.591 *literacy 


Each case is assigned to the group with the largest value. 

Tolerance measures the correlation of a candidate variable with the variables 
included in the model, and its values range from 0 to 1.0. Ifa variable is highly 
correlated with one or more of the others, the value of Tolerance is very small and the 
resulting estimates of the discriminant function coefficients may be very unstable. To 
avoid a loss of accuracy in the matrix inversion computations, you should rarely set the 
value of this limit to a lower value (the default is 0.001). LIFEEXPF, female life 
expectancy, has a very low Tolerance value, so it may be redundant or highly 
correlated with another variable or a linear combination of other variables. The 
Tolerance value of LIFEEXPM, male life expectancy, is also low—these two measures 
of life expectancy may be highly correlated with one another. Notice also that the value 
for BIRTH _RT is very low (0.0596) and its F-to-remove value is 2.01 15; its F-to-enter 
at step 0 in the forward Stepping example was 103.5. 

At step 7, no variable has an F-to-remove value less than 3.9, so the stepping stops. 
The final model found by backward stepping includes five variables: BIRTH RT, 
DEATH RT, EDUC, HEALTH, and MIL. We are not happy, however, with the low 
Tolerance values for two of these variables. The model found via automatic forward 
stepping did not include EDUC or HEALTH (their F-to-enter statistics at step 3 are 


0.01 and 1.24, respectively). URBAN and LITERACY appear more likely candidates, 
but their F"s are still less than 4.0. 


Classification Matrices 


In classification matrices, 93% and 91% countries 
five-variable discriminant functions. This is a slight improv e three 


for the first matrix and 89% for the jackknifed results The improvement from 91% to 
93% in the first matrix and 89% to 91% in the jackknifed matrix is because one New 
World country is now classified correctly in the first matrix and one Islamic country is 
classified correctly in the Jackknifed classification matrix. We add two variables and 
gain two correct classifications. 

Wilks's lambda (or U statistic), a multivariate analysis of variance statistic that 
varies between 0 and 1, tests the equality of group means for the variables in the 
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discriminant functions. Wilks’s lambda is transformed to an approximate F-ratio for 
comparison with the F distribution. Here, the associated probability is less than 
0.00005, indicating a highly significant difference among the groups. The Lawley- 
Hotelling trace and its F approximation are documented in Morrison (2004). When 
there are only two groups, it is seen to be equivalent to the Wilks’s lambda. Pillai’s 
trace and its F approximation are taken from Pillai (1960). 

The canonical discriminant functions list the coefficients of the canonical variables 
computed first for the data as input and then for the standardized values. For the 
unstandardized data, the first canonical variable is: 


—1.984 + 0.160*BIRTH_RT – 0.159*DEATH_RT + 0.236*EDUC 
— 0.260*HEALTH ~ 0.074*MIL 


The coefficients are adjusted so that the overall mean of the corresponding scores is 
0 and the pooled within-group variances are 1. After standardizing, the first canonical 
variable is: 


0.974*BIRTH RT – 0.519*”ЕАТН ЕТ + 1.557*EDUC 
– 1.557*HEALTH ~ 0.391*MIL 


Usually, one uses the latter set of coefficients to interpret what variables "drive" 
each canonical variable. Here, EDUC and HEALTH, the variables with low tolerance 
values, have the largest coefficients, and they appear to cancel one another. Also, in the 
final model, the size of their F-to-remove values indicates that they are the least useful 
variables in the model. This indicates that we do not have an optimum set of variables. 
These two variables contribute little alone, while together they enhance the separation 
of the groups. This suggests that the difference between EDUC and HEALTH could be 
a useful variable (for example, LET DIFF = EDUC – HEALTH). We did this, and the 
following is the first canonical variable for standardized values (we omit the constant): 


1.024*BIRTH КТ- 0.539*DEATH RT – 0.480*MIL + 0.553*DIFF 


From the Canonical scores of group means for the first canonical variable, the 
groups line up with Europe first, then New World in the middle, and Islamic on the 
right. In the second dimension, DEATH. RT and MIL (military expenditures) appear to 
separate Islamic and New World countries. 
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Mahalanobis Distances and Posterior Probabilities 


Even if you have already specified PLENGTH LONG, you must type PLENGTH / MAHAL 
to obtain Mahalanobis distances, 


The output is: 


Mahalanobis Distance-Square from Group Means and 
Posterior Probabilities for Group Membership 


Priors | 0.3455 0.2727 0.3818 
Group i Europe Islamic NewWorld 
Case ! Squared p-value Squared p-value Squared p-value 
| Distance Distance Distance 
ser ED. xU S pr EARUM р Але. 
Group Europe 
Ireland | 3.4012 0.9966 34.0999 9.0000 14.9366 0.0034 
Austria | 3.3170 0.9999 36.4857 0.0000 21.8439 0.0001 
Belgium* i 
Denmark | 8.6984 . 0.9986 34.0831 0.0000 22.0257 0.0014 
Finland ! 0.7080 1.0000 3810924 0.0000 22.6469 0.0000 
France ! 2.1628 1.0000 45.0838 0.0000 31.4423 0.0000 
Greece | 1.6437 1.0000 47.5839 0.0000 31.9330 0.0000 
Switzerland | 10.1933 1.0000 68.8096 0,0000 51.2757 0.0000 
Spain ! 8.7530 0.9981 46.1779 0.0000 21.4402 0.0019 
UK ! 1.8102 — 1.0000 43/7362 0.0000 32.1058 0.0000 
Italy ! ^ 1.9961 1.0000 ^ 45.4404 010000 29.1041 0.0000 
Sweden } 17667 1.0000 46.0446 0.0000 31.5799 0.0000 
Portugal i 10-1404 1.0000 5010844 0.0000 38.9003 0.0000 
Netherlands | 3,4491 0.9999 42.1519 010000 23.0115 0.0001 
WGermany і 8.2209 1.0000 6519707 0.0000 45.9244 0.0000 
Norway ! 4.5822 1.0000 37.2658 0.0000 29.8054 0.0000 
Poland i 1.9868 0.9989 30.8275 0.0000 15.7328 0.0011 
Hungary ! 4.2126 1.0000 42.6300 0.0000 29.0750 0.0000 
EGermany + 4.5978 1.0000 41.6454 0.0000 32.8865 0.0000 
Czechoslov | 0.6714 1.0000 42.2272 010000 27.3687 0.0000 
Group Islamic 
Gambia " 43.7648 0.0000 3.0356 0.9980 16.1605 0.0020 
Iraq і 66.8520 0.0000 21.7182. 0,9995 37.5738 0.0005 
Pakistan 1.40.6335 — 0.0000 1.6579 0.9974 14.2374 0.0026 
Bangladesh | 38.1494 0.0000 3.2954 0.9927 13.8046 0.0073 
Ethiopia ! 48.2829 0.0000 12.5872 0.8746 17.1449 0.1254 
Guinea ! 41.070 0.0000 8.0488 0.9998 25 8050 0.0002 
Malaysia | 37.9854 0.0000 10.3800 0,8043 13.8800 0.1957 
Senegal ! 46.1662 0.0000 2.1309 0.9917 12.3611 0.0083 
Mali | 49.9828 0.0000 6.1497 0,9998 24.1310 0.0002 
Libya» | 63-3301 9.0099 18.0688. 1.0000. 39.7918 0 0000 
Hoc tiara 5.6036 0,9997 22.7061 0.0003 
Sudan ' 43.7400 — 0.0000 0.4401 j 
Turkey»! 570722; 7344 ОЧ 6.6911 C EN 0:9324 
Algeria | 47.2061 0.0000 5.1183 0.7335 7.8162 0.2665 
Yemen 15727475 9.0000 ME NINE. "42/7092 021665 
roup NewWorld 
Argentina i 20.3538 0.0008 28,2858 Е 
Barbados i 18.7361 0.0048 . 24,6776 0:0000 222863 019980 
Bolivia ! 30.5978 0.0000 7.2087 0.1144 3.7879 0.8856 
Brazil ! 34.2783 0.0000 16.3385 0.0008 2.6703 0.9992 
Canada--> ! 2.8691 0.9966 28.4995 0.0000 14.4075 0.0034 
Chile ! 29.1842 0.0000 23.1567 0.0000 3.2342 1:0000 
Colombia i 42.8395 0.0000 2122669 0.0001 3.2844 019999 
CostaRica + 41.5094 0.0000 28.2898 0.0000 7.3617 1.0000 
Venezuela — | 52.3185 0.0000 21.4489 0.0005 6:8196 ^ 019995 
DominicanR. : 27.4923. 010000 14.6973 0.0009 1.2608 0.9991 
Uruguay { 19.3954 — 0.0053  28:5303 0.0000 9.1114 019947 
Ecuador ! 37.5375 0.0000 15.2378 0.0016 3.0869 0.9984 
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ElSalvador 1. 35.0571 0.0000 8.2590 0.0500 3.0429 0.9500 
Jamaica ! 27.6430 0.0000 20.1999 0.0011 7.3062 0.9988 
Guatemala |! 39.0563 0.0000 4.9469 0.4041 4.8432 0.5959 
Haiti--> | 39.8653 0.0000 2.6281 0.9957 14.1992 0.0043 
Honduras | 39.9424 0.0000 6.5112 0.3511 5.9561 0.6489 
Trinidad ! 43.0474 0.0000 20.9289 0.0009 7.5808 0.9991 
Peru | 21.9502 0.0000 13.3709 0.0017 1.2785 0.9983 
Panama | 26.5693 0.0000 19.9050 0.0002 3.3050 0.9998 
Cuba { 11.3560 0.0252 19.7852 0.0003 4.2421 0.9746 


--» case misclassified 
* case not used in computation 


For each case (up to 250 cases), the Mahalanobis distance squared (D°) is computed 
for each group mean. The closer a case is to a particular mean, the more likely it 
belongs to that group. The posterior probability for the distance of a case to a mean is 
the ratio of EXP(—0.5 * D^) for the group divided by the sum of EXP(-0.5 * D°) for 
all groups (prior probabilities, if specified, affect these computations). 

An arrow (-->) marks incorrectly classified cases, and an asterisk (*) flags cases with 
missing values. New World countries Bolivia and Haiti are classified as Islamic, and 
Canada is classified as Europe. Note that even though an asterisk marks Belgium, the 
results are printed—the value ofthe unused candidate variable URBAN is missing. No 
results are printed for Afghanistan because MIL, a variable in the final model, is 
missing. 

You can identify cases with all large distances as outliers. A case can have a 
probability 1.0 of belonging to a particular group but still have a large distance. Look 
at Iraq. It is correctly classified as Islamic, but its distance is 23.5. The distances in this 
panel are distributed approximately as a chi-square with degrees of freedom equal to 
the number of variables in the function. 


Example 4 1 
Discriminant Analysis Using Interactive Stepping 


Automatic forward and backward stepping can produce different sets of predictor 
variables, and other subsets of the variables may still perform equally well or possibly 
better. Here we use interactive stepping to explore alternative sets of variables. 


Using the OUR WORLD data, let’s say you decide not to include birth and death 


rates in the model because the rates are changing rapidly for several nations (that is, we 


omit these variables from the model). We also add the difference between EDUC and 


HEALTH as a candidate variable. 
SYSTAT provides several ways to speci 


the model. 


fy which variables to move into (or out of) 
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The input is: 
DISCRIM 
USE OURWORLD 
LET GDP CAP = 110 (GDP САР) 
LET (EDUC, HEALTH, MIL) = SQR(@) 
LET DIFFRNCE = EDUC - HEALTH А 
LABEL GROUP / 1="Europe", 2="Islamic", 3="NewWorld 
MODEL GROUP = URBAN BIRTH RT DEATH RT BABYMORT, 
GDP CAP EDUC HEALTH MIL B TO D, 
LIFEEXPM LIFEEXPF LITERACY DIFFRNCE 
GRAPH NONE 
START / BACK 
After interpreting these commands and printing the output below, SYSTAT waits for 
us to enter STEP instructions. 
The output is: 
Between F-matrix 
df : 12 i idein M 


Europe 0.0000 
Islamic 25.3059 0.0000 
NewWorld | 18.0596 7.3754 0.0000 
Variable F-to-remoye Tolerance | Variable F-to-enter 
Neu пне РАВНА I Pdl HH d deri Еа 2-2 
6 URBAN 2.1673 0.4365 ; 40 DIFFRNCE 0.0000 
8 BIRTH RT 2.0115 0.0596 | 
10 DEATH RT 2.2580 0.0915 ! 
12  BABYMORT 0.1032 0.0840 : 
16 GDP CAP 0.6218 0.1435 | 
19 EDUC 6.1198 0.0651 | 
21 HEALTH 5.3565 0.0832 | 
23 MIL 7.1137 0.3235 ; 
34 втор 0.5532 0.1361 : 
30 LIFEEXPM 0.2617 0.0361 | 
31 LIFEEXPF 0.0675 0.0123 | 
32 LITERACY 1.4456 0.1778 | 


Tolerance 


0.0000 
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A summary of the STEP arguments (variable numbers are visible in the output) 


follows: 
а. STEP BIRTH RT Remove two variables 
DEATH RT 
b STEP LIFEEXPF Remove one variable 
6. STEP - Remove LIFEEXPM 
а. STEP - Remove BABYMORT 
e STEP - Remove URBAN 
f. STEP - Remove GDP CAP 
g STEP EDUC HEALTH Remove EDUC and HEALTH; add 
DIFFRNCE DIFFRNCE 
h. STEP + Enter СЮР CAP 
PLENGTH/CLASS JCLASS 
SCDFUNC 
STOP 


Notice that the seventh STEP specification (g) removes EDUC and HEALTH and 
enters DIFFRNCE. Remember, after the last step, to type STOP for the canonical 
variable results and other summaries. 


Steps 1 and 2 


The input is: 
STEP BIRTH RT DEATH RT 


The output is: 


жазана Step 1 -- 


РТТ 


Variable BIRTH RT Removed 


Between Groups F-matrix 


df : 11 42 
| Europe Islamic NewWorld 
— НЕ-Е rie remonte 
Europe 1 0.0000 
Islamic | 26.3672 0.0000 
NewWorld | 18.0391 8.2404 0.0000 
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance 
ИИИнин ығ езе hers te A eus mee ci ener a OR met 
2.6354 0.4379 | 8 BIRTH RT 2.0115 0.0596 
10 MR RT 2.0024 0.0928 | 40 DIFFRNCE 0.0000 0.0000 
12 ВАВҮМОВТ 0.1371 0.0914 ! 
16 GDP CAP 1.3957 0.1509 | 
19 EDUC 5.9852 0.0658 | 
21 HEALTH 4.2425 0.0909 | 
23 MIL 5.9242 0.3850 , 
0.3454 0.3300 | 


за втор 
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30 LIFEEXPM 0.4198 0.0365 
31 LIFEEXPF 0.9612 0.0160 
32 LITERACY 1.7886 0.2920 

ieee ЕРИ Step 2 -- 

Between Groups F-matrix df : 10 43 

i Europe Islamic NeMMorid 

Europe i 0.0000 

Islamic | 27.8162 0.0000 

NewWorld | 18.1733 9.2794 0.0000 
Variable F-to-remove Tolerance 

6 URBAN 2.1956 0.4525 
12 BABYMORT 0.2250 0.1090 
16  GDP CAP 1.1364 0.1535 
19 EDUC 6.5230 0.0659 
21 HEALTH 6.2844 0.0935 
23 MIL 6.6860 0.3854 
34 B TO D 6.4808 0.6519 
30  LIFEEXPM 0.5111 0.0366 
31  LIFEEXPF 0.2752 0.0192 
32 LITERACY 1.8929 0.3124 
Step 3 
The input is: 


STEP LIFEEXPF 


The output is; 


THERE ТАРИ 


Step 3 


Between Groups F-matrix 
df : 9 44 
| Europe а 
па аи wf we ы Илнурга ИИ, 
Europe ! 0.0000 
Islamic ! 31.1645 0.0000 
NewWorld | 20.4611 10.4752 
Variable F-to-remove 


12 BABYMORT 


16  GDP CAP 
19 EDUC 

21 HEALTH 
23 мп, 

34 втор 
30 ІЛҒЕЕХРМ 


32 LITERACY 


2.4381 


NewWorld 


0.3387 


Variable DEATH RT Removed 


Variable LIFEEXPF Removed 


Variable 

8 BIRTH RT 
10 DEATH RT 
40 DIFFRNCE 


Variable 

8 BIRTH RT 
10 DEATH RT 
31  LIFEEXPF 
40  DIFFRNCE 


LIII 


F-to-enter 


1.7533 
2.0024 
0.0000 


Жа . нь 


F-to-enter 


Tolerance 


0.0000 


Tolerance 


0.0860 
0.1118 
0.0192 
0.0000 
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Steps 4 through 7 


The input is: 


STEP 


The output is: 


ok KORR Kd CROCO 


Step 4 


Between Groups F-matrix 


Variable LIFEEXPM Removed 


Discriminant Analysis 


A o d oh 


df : 8 45 
| Europe Islamic 
---------- + 
Europe Н А 
Islamic ! 35.3422 0.0000 
NewWorld | 23.3116 11.9720 0.0000 
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance 
НЫ кетет eed FD pees 4кее-елстест-25-с<--с-------------------- 
6 ОВВАМ 2.4766 0.4862 | 8 BIRTH RT 0.6807 0.1385 
12 BABYMORT 0.5241 0.2498 | 10 DEATH RT 1.3753 0.1822 
16 СОР CAP 1.7080 0.1736 | 30 LIFEEXPM 0.2770 0.1512 
19 EDUC 7.3209 0.0694 | 31 LIFEEXPF 0.0382 0.0795 
21 HEALTH 7.1759 0.0999 | 40 DIFFRNCE 0.0000 0.0000 
23 MIL 7.0459 0.3914 | 
34 B TO D 9.0607 0.7692 
32 LITERACY 2.3962 0.3463 
(We omit steps 5, 6, and 7. Each step corresponds to a STEP -) 
Steps 8, 9, and 10 
The input is: 
STEP EDUC HEALTH DIFFRNCE 
The output is: 
занака Step 8 -- Variable EDUC Removed STII III II 
Between Groups F-matrix 
df : 4 49 
| Europe Islamic NewWorld 
—€—M Шыныға 
Europe 1 0.0000 
Islamic ! 49.9302 0.0000 
NewWorld ! 34.1490 20.8722 0.0000 
Variable F-to-remove Tolerance | Variable F-to-enter Tolerance 
NEN eua c eal eee ae een mm cttm eme 
2.4382 0.6527 | 6 URBAN 2.3235 0.5201 
2s = 6.6713 0.6012 | 8 BIRTH RT 3.2442 0.2481 
16.1437 0.8875 | 10 DEATH RT 0.3984 0.2418 


1-432 


Chapter 11 


Variable HEALTH Removed 


0.7619 | 12 ВАВҮМОВТ 2.0856 
i 16  GDP CAP 1.1183 
i 19 EDUC 5.1427 
H 30 ШІҒЕЕХРМ 0.8844 
i 31 LIFEEXPF 2.0306 
i 40 DIFFRNCE 5.1427 


XA A o eee 


NewWorld 
0.0000 
Tolerance | Variable F-to-enter 
0.7720 | 6 URBAN 2.5499 
0.9148 | 8 BIRTH RT 3.9106 
0.8057 | 10 DEATH RT 0.4229 
i 12 BABYMORT 3.1122 
i 16 GDP_CAP 3.0247 
12:19 EDUC 0.3331 
i 21 HEALTH 2.4382 
i 30 LIFEEXPM 1.5845 
Н 31 LIFEEXPF 3.3326 
i 40 DIFFRNCE 6.9829 


32 LITERACY 33.2365 
22255242444 Step 9 -- 
Between Groups F-matrix 
df : 3 50 
| Europe Islamic 
ные ве ЖУ а Va uem жм ек TS a= ген 
Europe i 0.0000 
Islamic | 61.6708 0.0000 
NewWorld | 41.4085 28.1939 
Variable F-to-remove 
23 MIL 14.6982 
34 втор 27.0850 
32 LITERACY 52.3463 
пити Step 10 -- 


Between Groups F-matrix 


Variable DIFFRNCE Entered 


555604 


0. 
0. 
0. 
0. 
0. 
0. 


3268 
2771 
0836 
3135 
2500 
7432 


То1егапсе 


------- -----------------.....................ұ)ф----------------------------------------- 


То1егапсе 


NewWorld 
0.0000 
Tolerance ! Variable F-to-enter 
0.6840 | 6 URBAN 2.5012 
0.9001 | 8 BIRTH RT 3.8918 
0.7922 | 10 DEATH RT 0.4126 
0.7721 | 12  BABYMORT 3.2583 
! 16 СРР CAP 4.2987 
i 19  ÉDUC 0.9358 
i 21 HEALTH 0.9358 
i 30 LIFEEXPM 0.9780 
! 31  LIFEEXPF 2.4024 


df : 4 49 
| Europe Islamic 
құтыны У лағыл қа Ж Ф SERE 
Europe t 0.0000 
Islamic ! 60.8974 0.0000 
NewWorld | 38.7925 22.4751 
Variable F-to-remove 
наар: igo opa sei eri шч азир Graf iint ср 
23 MIL 16.6550 
34 B TO D 13.9698 
32 LITERACY 47.3838 
40 DIFFRNCE 6.9829 
Step 11 
The input is: 


STEP + 
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The output is: 
kexxkieoee eee Step 11 -- Variable СОР САР Entered 
Between Groups F-matrix 
df : 5 48 
| Europe Islamic NewWorld 
Europe | 0.0000 
Islamic | 57.5419 0.0000 
NewWorld | 35.7426 18.6879 0.0000 
Variable F-to-remove Tolerance Variable 
16 GDP_CAP 4.2987 0.3723 | 6 URBAN 
23 MIL 5.8796 0.4785 | 8 BIRTH RT 
34 B TO D 9.4602 0.8880 | 10 DEATH RT 
32 LITERACY 12.3140 0.6096 ! 12 BABYMORT 
40 DIFFRNCE 8.3683 0.7352 ! 19 EDUC 
Н 21 HEALTH 
Н 30 LIFEEXPM 
Н 31 LIFEEXPF 
Final Model 
The input is: 
STOP 
The output is: 


Stepping Summary 


Discriminant Analysis 
xxx e 

F-to-enter Tolerance 

2.7203 0.5135 

1.0357 0.1896 

1.0027 0.2159 

0.7120 0.2566 

0.3649 0.3246 

0.3649 0.3960 

0.0419 0.2599 

0.2355 0.1807 
p-value 
BIRTH RT | -2.0115 14.3085 0.0000 
DEATH RT | -2.0024 0.0486 15.2053 0.0000 
LIFEEXPF | -0.2752 0.0492 17.1471 0.0000 
LIFEEXPM | -0.2770 0.0498 19.5708 0.0000 
ВАВҮМОВТ | -0.5241 0.0510 22.5267 0.0000 
URBAN | -2.6152 0.0568 25.0342 0.0000 
GDP_CAP | -3.5833 0.0655 27.9210 0.0000 
EDUC ! -5.1427 0.0795 31.1990 0.0000 
HEALTH 1 -2.4382 0.0874 39.7089 0.0000 
DIFFRNCE | 6.9829 0.0680 34.7213 0.0000 
GDP CAP | 4.2987 0.0577 30.3710 0.0000 


Classification Matrix (Cases in row categories classifi 


| Europe Islamic NewWorld ^ $correct 
---------- Шы bb (os SL ок саат 
Europe 1 19 0 ( i 
Islamic | g 1 13 35 
Н 
Ep x 20 15 20 95 


Total 


ed into columns) 
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Jackknifed Classification Matrix 


Europe Islamic NewWorld 


---—------ фине eS 
Europe Н 19 0 
Islamic | 0 12 
NewWorld | 1 3 

Total i 20 15 
ues 


6.3185 1.3688 


Canonical Correlations 
0.9292 0.7602 


Cumulative Proportion of Total Dispersion 
0.8219 


Canonical Discriminant Functions : Standardized by Within Variances 


) 1 2 
cR t emen CENE 
URBAN — | к 7 
BIRTH RT ! $ : 
DEATH RT ! $ : 
BABYMORT | 5 Н 
СОР CAP | 0.6868 0.0377 
EDUC 1 3 А 
HEALTH | 2 қ 
MIL ! 0.0676 0.8395 
B TO D | -0.4461 -0;5037 
LIFEEXPM | ; Р 
LIFEEXPF | : : 
LITERACY | 0.3903  -0.8573 
DIFFRNCE | -0.6378 -0.0291 


Canonical Scores of Group Means 


NewWorld 


-0.7964 -1.3992 
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A summary of results for the models estimated by forward, backward, and interactive 
stepping follows: 


Model % Correct % Correct 
(Class) (Jackknife) 


Forward (automatic) 


1. BIRTH_RT DEATH_RT MIL 91 89 
Backward (automatic) 
2. BIRTH_RT DEATH_RT MIL EDUC HEALTH 93 91 


Interactive (ignoring BIRTH_RT and DEATH_RT) 


3. MIL B TO D LITERACY 84 84 
4. MIL B TO D LITERACY EDUC HEALTH 91 89 
5. MIL B TO D LITERACY DIFFRNCE 91 89 
6. MIL B TO D LITERACY DIFFRNCE GDP САР 95 87 


Notice that the largest difference between the two classification methods (95% versus 
8794) occurs for the last model, which includes the most variables. A difference like 
this one (896) can indicate overfitting of correlated candidate variables. Since the 
jackknifed results can still be overly optimistic, cross-validation should be considered. 


Example 5 
Contrasts 


Contrasts are available with commands only. When you have specific hypotheses 
about differences among particular groups, you can specify one or more contrasts to 
direct the entry (or removal) of variables in the model. 

According to the jackknifed classification results in the stepwise examples, the 
European countries are always classified correctly (100% correct). All of the 
misclassifications are New World countries classified as Islamic or vice versa. In order 
to maximize the difference between the second (Islamic) and third groups (New 
World), we specify contrast coefficients with commands: 


CONTRAST [0 -1 1] 


If we want to specify linear and quadratic contrasts across four groups, we could 
specify: 
CONTRAST [-3 -1 1 3; -1 1 1 -1] 


1-436 


Chapter 11 


оғ 


CONTRAST [-3 -1 1 3 
Еа + f 


Here, we use the first contrast and request interactive forward stepping. 


The input is: 


DISCRIM 
USE OURWORLD 


LIFEEXPM LIFEEXPF LITERACY 


CONTRAST [0 -1 1] 
PLENGTH SHORT 

START / FORWARD 
STEP LITERACY 

STEP MIL 

STEP URBAN 
PLENGTH/CLASS JCLASS 
STOP 


After viewing the results, remember to cancel the contrast if you plan to do other 


discriminant analyses: 
CONTRAST / CLEAR 


The output is: 


Forward Stepwise with Alpha-to-Enter ; 0.150 and Alpha-to-Remove 


Variable F-to-remove Tolerance 


: 0.150 


i Variable F-to-enter Tolerance 
rE SUR Тн e a Пана LL t o oLerance 
1 6 URBAN 21.8716 1.0000 
i 8 BIRTH_RT 59.0625 1.0000 
' 10 DEATH RT 28.7880 1.0000 
i 12 BABYMORT 44.1184 1.0000 
! 16 GDP САР 14.3163 1.0000 
i 19 EDUC 1.3007 1.0000 
i 21 HEALTH 3.3435 1.0000 
! 23 MIL 0.6542 1.0000 
| 34 втор 1.1216 1.0000 
' 30 ІЛҒЕБХРМ 34.9999 1.0000 
! 31  LIFEEXPF 43.1599 1.0000 
i 32 LITERACY 64.8444 1.0000 
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(We omit results for steps 1, 2, and 3.) 
Stepping Summary 


Wilks's Lambda 


| F(*ent,-rem) 


Discriminant Analysis 


64.8444 0.4450 
i 9.9626 0.3723 
URBAN Н 2.9533 0.3515 


Approx. F-ratio p-value 
64.8444 0.0000 
42.9917 0.0000 
30.7433 0.0000 


Classification Matrix (Cases in row categories classified into columns) 


| Europe Islamic NewWorld $correct 
TI е nmm erre rri c istam cai eee ie de spl ee атлас т 
Europe | 18 0 1 95 
Islamic 0 14 1 93 
NewWorld | 2 3 16 76 
Total 20 17 18 87 
Jackknifed Classification Matrix 
| Europe Islamic NewWorld $correct 
ЕШ ШП ШЕНЕ eee сга чыт. 
i 18 0 1 95 
Islamic ! 0 14 1 93 
NewWorld ! 2 3 16 76 
Total | 20 17 18 87 
Eigenvalues 
1.8447 


Cumulative Proportion of Total Dispersion 


1.0000 


Canonical Scores of Group Means 


шысына ысыла 6) 


Europe 0.8821 
Islamic -2.3969 
NewWorld | 0.9140 


Compare the F-to-enter values with those in the forward stepping example 
statistics here indicate that for the economic variables (СОР CAP, EDUC, 
and MIL), the differences between the second and third groups 


. The 
HEALTH, 
are much smaller than 


those when European countries are included. 
The Jackknifed classification matrix indicates that when LITERACY, MIL, and 


URBAN are used, 87% of the countries are classified correctly. This is the same 


percentage correct as in the forward st 


epping example for the model with BIRTH RT, 


DEATH КТ, and MIL. Here, however, one fewer Islamic country is misclassified, and 
one European country is now classified incorrectly. 
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When you look at the canonical results, you see that because a single contrast has 
one degree of freedom, only one dimension is defined—that is, there is only one 
eigenvalue and one canonical variable. 


Example 6 
Quadratic Model 


One of the assumptions necessary for linear discriminant analysis is equality of 
covariance matrices. Within-group scatterplot matrices (SPLOM's) provide a picture 
of how measures co-vary. Here we add 85% ellipses of concentration to enhance our 
view ofthe bivariate relations. Since our sample sizes do not differ markedly (15 to 21 
countries per group), the ellipses for each pair of variables should have approximately 
the same shape and tilt across groups if the equality of covariance assumption holds. 
The input is: 


DISCRIM 
USE OURWORLD 
LET(EDUC, HEALTH, MIL) = SQR(@) 
STAND 


SPLOM BIRTH RT DEATH RT EDUC HEALTH MIL / HALF ROW-1, 
GROUP-group$ ELL-.85 DENSITY=NORMAL 


The output is: 


Europe Islamic NewWorld 
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for the linear model. For five variables, for example, the linear and quadratic models, 
respectively, for each group are: 


f = at bx, +cx,+ dx; + ех, fXs 


Е 2 2 
f = at bx, + cx, + аху + ex, fis gxyxa +... + рх; + qXy +... + их5 


So the linear model has six parameters to estimate for each group, and the quadratic 
has 21. These parameters are not all independent, so we do not require as many as 
(3 x 21) cases for a quadratic fit. 

In this example, we fit a quadratic model using the subset of variables identified in 
the backward stepping example. Following this, we examine results for the subset 
identified in the interactive stepping example before EDUC and HEALTH are 
removed. 


The input is: 


DISCRIM 
USE OURWORLD 
LET (EDUC, HEALTH, MIL) = SQR(@) 
LABEL GROUP / 1="Еџкоре", 2="Islamic", 3="NewWorld" 
MODEL GROUP = BIRTH RT DEATH_RT EDUC HEALTH MIL / QUAD 


PLENGTH SHORT / GCOV WCOV GCOR CFUNC CLASS JCLASS MAHAL 


IDVAR COUNTRY$ 
ESTIMATE 


MODEL GROUP - EDUC HEALTH MIL B TO D LITERACY / QUAD 
ESTIMATE 


For the first model, 


The output is: 


Pooled Within Covariance Matrix 


df : 53 
| BIRTH RT DEATH RT EDUC HEALTH MIL 
eee NT aia ТЕГЕ >ч айла линија 
BIRTH RT H 36.2044 
DEATH RT | 10.8948 10.4790 
EDUC | -16.1749 -7.2497 42.8231 
HEALTH -12.9261 -4.9333 36.5504 35.0939 


MIL -9.6390 -7.7297 22.0789 16.9130 27.7095 
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Group Europe Covariance Matrix 


| BIRTH RT DEATH RT EDUC HEALTH MIL 
XL E sinters i Se Seaham nt DES Nu 
BIRTH RT | 1.7342 
DEATH RT | 0.0184 1.8184 
EDUC ~ Н 2.0051 1.3359 47.1696 
HEALTH 1 1.3943 -0.3625 44.2594 47.3538 
MIL H 0.8255 1.2689 15.2891 14.7387 15.7686 
Group Europe Correlation Matrix 

! BIRTH RT DEATH RT EDUC HEALTH MIL 
td Pes ii, APRI ANGE Mincir 0 Delia eui С Ric e n MS NÉ 
BIRTH RT | 1.0000 
DEATH RT | 0.0104 1.0000 
EDUC i 0.2217 0.1442 1.0000 
HEALTH Н 0.1539 -0.0391 0.9365 1.0000 
MIL Н 0.1579 0.2370 0.5606 0.5394 1.0000 


Ln( Det (COV of Group Europe) ) : 8.67105970 
Group Europe Discriminant Function Coefficients 


| BIRTH RT DEATH RT EDUC HEALTH MIL Constant 

— аи караа UID c NR cep дее 

BIRTH RT | -0.3175 

DEATH RT | -0.0487 70.4076 

EDUC Н 0.0498 0.1140 -0.1254 

HEALTH 1 -0.0408 -0.1196 0.1162 -0.1234 

мп, H 0.0104 0.0367 0.0011 0.0144 -0.0498 

Constant | 8.2708 8.6664 -3.3008 3.4017 70.0936 -102.3559 

Group Islamic Covariance Matrix 

1 BIRTH RT DEATH RT EDUC HEALTH MIL 

me LU. | Е ане Ы 

BIRTH RT | 48.6381 

DEATH RT | 27.5429 25.5429 

EDUC i -19.8729 -20.3689 33.7508 

HEALTH | -10.9262 -10.6192 18.8309 10.8603 

MIL | -15.5902 -28.4991 36.6788 19.3235 66.0183 


Group Islamic Correlation Matrix 


i BIRTH RT DEATH RT EDUC HEALTH MIL 
---------- + 
BIRTH_RT } 
DEATH RT } 
EDUC ! 1.0000 
HEALTH | -0.4754 -0.6376 0,9836 1.0000 
MIL i -0.2751 70.6940 0.7770 0.7217 1.0000 


Ln( Det (COV of Group Islamic) ) : 10.34980794 
Group Islamic Discriminant Function Coefficients 


I BIRTH RT — DEATH RT EDUC HEALTH RIL. "Gahalant 
BIRTH АТ. 170.0472 2), QUEEN NEN epee eran 
DEATH RT | 0.0703 -0,1502 
EDUC ! 0.0099  -0:0726 -0.7155 
HEALTH | -0.0424 0.1331 1.0933 -1.7901 
MIL Ка 0381 
Gonstapt..i, (J;898S., 71.3819." овезаи -0.7912 -40.8974 
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Group NewWorld Covariance Matrix 
| BIRTH_RT DEATH_RT EDUC HEALTH MIL 
——— ee і------Ш-2---2..2.12.1.1..-.........................-. 
BIRTH RT | 60.2476 
DEATH RT | 9.5738 8.1619 
EDUC ! -30.8573 -6.2226 45.0446 
HEALTH | -21.9303 -5.2955 41.6304 40.4104 
MIL | -15.4143 -1.7399 18.3092 17.2913 12.2372 
Group NewWorld Correlation Matrix 
| BIRTH RT DEATH RT EDUC HEALTH MIL 
—— mm a tie m thm itam imr de mentam 
BIRTH RT | 1.0000 
DEATH RT | 0.4317 1.0000 
EDUC 1 -0.5923 -0.3245 1.0000 
HEALTH i -0.5661 -0.2916 0.9758 1.0000 
MIL i -0.5677 -0.1741 0.7798 0.7776 1.0000 
Ln( Det (COV of Group NewWorld) ) : 11.46371023 
Group NewWorld Discriminant Function Coefficients 
| BIRTH RT DEATH RT EDUC HEALTH MIL Constant 
BIRTH RT | -0.0153 
DEATH RT | 0.0121 -0.0801 
EDUC i -0.0079 -0.0213 -0.2521 
HEALTH i 0.0040 0.0114 0.2418 -0.2663 
MIL | =0.0115 0.0196 0.0225 0.0210 -0.1160 
Constant | 0.8708 0.5285 1.6527 -1.3085 1.0459 -26.6248 
Ln( Det (Pooled Covariance Matrix) ) : 13.05914566 
Test for Equality of Covariance Matrices 
Chi-square : 139.5799 
ағ i 30 
p-value 3 0.0000 
Between Groups F-matrix 
df : 5 49 
| Europe Islamic | NewWorld 
+ anh gd 
Europe | 0.0000 
Islamic ! 64.4526 0.0000 
NewWorld | 43.1437 15.9199 0.0000 
Mahalanobis Distance-Square from Group Means and 
Posterior Probabilities for Group Membership 
i .3333 
Priors : 0.3333 0.3333 0 
Group ! Europe Islamic NewWorld У 
Сазе ! Squared p-value Squared p-value Squared p-value 
| Distance pastance 
РЕ 20. eR) E 222522222 0 MOM APTA Kudai o Ен 
Group NewWorld 
Argentina | — 48.0644 0.0000 45.1886 0.0000 FEE 440029 
Barbados | 31.7731 0.0000 64.9539 0.0000 83919 ED: 
Bolivia--> { 369.2605 0.0000 4.0887 0.6484 4. 0.9742 
Brazil | 133.3136 0.0000 9.4407 0.0258 rae 0.2142 
Сапада--> Н 14.5344 0.8785 533.6186 0.0000 15.6 4210 
Chile | — 66.6003 0.0000 16.6122 0.0010 1.7558 0. 0 
Colombia | 161.1058 0.0000 9.1600 0.0427 1.8241 9.9513 
CostaRica | 181.6378 0.0000 93.1772 0.0000 +8 1-00 
Venezuela ! 180.8529 0.0000 16.5792 0.0089 6.043 .9911 
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DominicanR. ! 175.3084 0.0000 21.5274 0.0001 2.3156 0.9999 
Uruguay i 23.1473 0.0007 38.4469 0.0000 5.8099 0.9993 
Ecuador i 212.1849 0.0000 5.7723 0.1291 0.8404 0.8709 
ElSalvador i 312.8880 0.0000 10.0234 0.0313 2.0432 0.9687 
Jamaica i 73.7875 0.0000 20.1943 0.0003 2.5281 0.9997 
Guatemala i 404.9029 0.0000 5.9785 0.1727 1.7312 0.8273 
Haiti--> t 792.0701 0.0000 3.9138 0.9852 11.1944 0.0148 
Honduras i 395.9024 0.0000 16.1056 0.0042 4.0530 0.9958 
Trinidad i 164.1027 0.0000 37.9530 0.0000 5.6282 1.0000 
Peru i 167.6030 0.0000 18.8545 0.0016 4.9019 0.9984 
Panama i 133.9438 0.0000 97.6904 0.0000 3.3833 1.0000 
Cuba i 33.5803 0.0000 39.6952 0.0000 6.8234 1.0000 


+ 


--? case misclassified 
* case not used in computation 


Classification Matrix (Cases in row categories classified into columns) 


Europe Islamic NewWorld Эсоггесе 


таса... a al c Ра Веји Rta су С тыла 
Europe { 20 0 0 100 
Islamic | 0 14 1 93 
NewWorld ! 1 2 18 86 
Total i 21 16 19 93 


Jackknifed Classification Matrix 


Europe Islamic NewWorld %соггесё 


Aad е-е deine mitem ааа EE 
Europe i 20 0 0 100 
Islamic | 0 13 2 87 
NewWorld } 1 2 18 86 
Total i 21 15 20 91 


(We omit the eigenvalues, etc.) 


Look at the quadratic function displayed at the beginning of this example. For our data, 
the coefficients for the European group are: 


а = 102.3559, b = 8.2708, с = 8.6663, d =-3.3008 е = ; = 
= -0.049,..., p = 0.014, q =--0.3175, .. andy = 04977 2008,6 =34017,/- -0.0936, 


or 


F= -102.3559 + 8.2708*BIRTH RT +... -0.0487*BIRTH RT* 
—03175*BIRTH. RT? + ... —0.4977*MIL? ` ve анти RT + 


The Mahalanobis distances reveal that only four cases аге misclassifi 


ied: Turkey as 
a New World country, Canada as European, and Haiti and Bolivia as Islamic. 
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The classification matrix indicates that 93% of the countries are correctly classified; 
using the jackknifed results, the percentage drops to 91%. The latter percentage agrees 
with that for the linear model using the same variables. 


Now the second model: 


The output is: 


Between Groups F-matrix 


df : 5 49 


Europe 
Islamic 
NewWorld 


Europe 


Islamic 


NewWorld 


0.0000 
51.5154 
33.6025 


0.0000 
17,9915 


Mahalanobis Distance-S 
Posterior Probabilities for Group Membership 


Priors ; 0.3333 0.3333 
Group | Europe Islamic 
Case H Squared p-value Squared p-value 
! Distance Distance 
-------------- dmm mitis s tiit e a A 
Group NewWorld 
Argentina | 30.8883 0.0000 48.2685 0.0000 
Barbados | 35.4644 0.0000 68.7369 0.0000 
Bolivia | 186.1861 0.0000 10.0985 0.0814 
Brazil | 230.7847 0.0000 8.0834 0.1281 
Canada--» 1 19.3760 0.7391 524.2890 0.0000 
Chile 1 144.2802 0.0000 17.1981 0.0019 
Colombia 475.0733 0.0000 29.7749 0.0000 
CostaRica 834.5344 0.0000 190.5182 0.0000 
Venezuela 932.5359 0.0000 83.5746 0.0000 
DominicanR. | 267.3835 0.0000 18.5805 0.0012 
Uruguay i 15.2482 0.0443 60.5296 0.0000 
Ecuador i 275.9703 0.0000 11.5005 0.0234 
ElSalvador | 497.9833 0.0000 17.6217 0.0016 
Jamaica ! 312.0028 0.0000 15.4562 0.0029 
Guatemala i 501.2526 0.0000 7.8916 0.2414 
Haiti--» i 648.4151 0.0000 4.5825 0.9870 
Honduras |! 688.1116 0.0000 31.8023 0.0000 
Trinidad | 315.4393 0.0000 43.0679 0.0000 
Peru | 179.9040 0.0000 16.3268 0.0163 
Panama | 411.0113 0.0000 109.6716 0.0000 
Cuba 54.6696 0.0000 54.4701 0.0000 


0.0000 


from Group Means and 


0.3333 
NewWorld 
Squared p-value 

Distance 
4.2884 1.0000 
7.3689 1.0000 
2.1795 0.9186 
1.1761 0.8719 
16.2638 0.2609 
1.5605 0.9981 
1.9435 1.0000 
10.2752 1.0000 
8.7522 1.0000 
2.0442 0.9988 
3.9124 0.9557 
0.9626 0.9766 
1.6911 0.9984 
0.7025 0.9971 
2.5306 0.7586 
10.1777 0.0130 
4.0464 1.0000 
4.6147 1.0000 
5.0558 0.9837 
3.6392 1.0000 
6.8148 1.0000 


--» case misclassified 
* case not used in computation 


Classification Matrix (Cases in row categories classified into ы ашай 


| Europe Islamic 
MÓN denim maa ДИ је а 2-с 
Europe l 20 0 
Islamic | 0 15 
NewWorld | 1 1 
Total 21 16 


NewWorld *correct 
ud 0 100 
0 100 

19 90 

19 96 
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Jackknifed Classification Matrix 


i Europe Islamic NewWorld 
Europe 19 0 1 
Islamic | 0 14 1 
NewWorld | 1 1 19 
Total 20 15 21 
Eigenvalues 
5.5852 1.3908 


Canonical Correlations 
0.9209 


Cumulative Proportion of Total Dispersion 
0.8006 1.0000 


! 1 2 

end ыты ML RE 
Europe i 72.9164 0.5014 
Islamic | 2.7255 1.3222 
NewWorld | 0.8307 -1.4219 


This model does slightly better than the first one - the 


that 96% and 93%, respectively, 


Example 7 
Cross-Validation 


sample. The first sample is often 


sample. The proportion of correct classification for the te: 
measure for the success of the discrimina 


%соггесе 


tion. 


classification matrices here show 


are classified correctly. This is because Turkey and 
Bolivia are classified correctly here and misclassified 


with the first model. 


we should try the rules on 


results with those for the original data. Since this usually 
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Cross-validation is easy to implement in discriminant analysis. Cases assigned a 
weight of 0 are not used to estimate the discriminant functions but are classified into 
groups. In this example, we generate a uniform random number (values range from 0 
to 1.0) for each case, and when it is less than 0.65, the value 1.0 is stored in a new 
weight variable named CASE_USE. If the random number is equal to or greater than 
0.65, a 0 is placed іп the weight variable. So, approximately 65% of the cases have a 
weight of 1.0, and 35%, a weight of 0. 

We now request a cross-validation for each of the following six models using the 
OURWORLD data: 


BIRTH_RT DEATH_RT MIL 

BIRTH RT DEATH. RT MIL EDUC HEALTH 
MIL B TO D LITERACY 

MIL B TO D LITERACY EDUC HEALTH 
MIL B TO D LITERACY DIFFRNCE 

MIL B TO D LITERACY DIFFRNCE GDP CAP 


= 


Use interactive forward stepping to "toggle" variables іп and out of the model subsets. 


The input is: 


DISCRIM 

USE OURWORLD 

RSEED 12345 

LET GDP_CAP = L10 (GDP_CAP) 

LET (EDUC, HEALTH, MIL) = SQR(@) 

LET DIFFRNCE = EDUC - HEALTH 

LET CASE_USE = URN()< .65 

WEIGHT CASE_USE 

LABEL GROUP / 1="Europe", 2="Islamic", 3=" NewWorld" 

MODEL GROUP = URBAN BIRTH RT DEATH RT BABYMORT, 
GDP CAP EDUC HEALTH MIL B TO D, 
LIFEEXPM LIFEEXPF LITERACY DIFFRNCE 

PLENGTH NONE / FSTATS CLASS JCLASS 

GRAPH NONE 

START / FORWARD 

STEP BIRTH RT DEATH RT MIL 

STEP EDUC HEALTH 

STEP BIRTH RT DEATH RT EDUC HEALTH B TO D LITERACY 

STEP EDUC HEALTH 

STEP EDUC HEALTH DIFFRNCE 

STEP GDP CAP 

STOP ig 


Here are the results from the first STEP after MIL enters: 
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Тһе ошрш і: 
Variable F-to-remove 
1 8 BIRTH RT 37.099 0.562 | 6 URBAN 2.471 
10 DEATH RT 19.180 0.481 | 12 BABYMORT 0.648 
43 "MIL “ 9.581 0.794 | 16 GDP CAP 0.520 
i 19 EDUC 0.049 
1 21 HEALTH 0.443 
Н 34 B TO D 2.472 
Н 30 LIFEEXPM 0.228 
Н 31 LIFEEXPF 1.495 
i 32 LITERACY 1.041 
Н 40 DIFFRNCE 3.591 


Classification Matrix (Cases in том categories classified into columns) 


Europe Islamic NewWorld Scorrect 


2-56 ора алсала, voir 
Еџгоре H 9 0 0 100 
Islamic | 0 8 2 80 
NewWorld | 1 2 13 81 

Total Н 10 10 15 86 


Classification of Cases with zero weight or frequency 


Europe Islamic NewWorld *$correct 


—M— үтте т а t ELS E ACC СЕЗЕ 
Europe i 10 0 0 100 
Islamic ! 0 5 0 100 
NewWorld | 1 0 4 80 
$correct | 11 5 4 95 


Jackknifed Classification Matrix 


i Europe Islamic NewWorld correct 


the learning sample. Notice that the percentages of corre 
those for the learning sample than for the test sample. 
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Now we add the variables EDUC and HEALTH, with the following results: 


Variable F-to-remove Tolerance | Variable 
ERE ағы. ТРИ АН + 

8 BIRTH_RT 10.789 0.451 | 6 URBAN 

10 DEATH_RT 9.334 0.449 | 12 BABYMORT 

19 EDUC 3.133 0.059 | 16  GDP CAP 

21 HEALTH 3.596 0.067 | 34 втор 

23 MIL 5.421 0.589 | 30 LIFEEXPM 
i 31 LIFEEXPF 
i 32 LITERACY 
i 40 DIFFRNCE 


F-to-enter 


Classification Matrix (Cases in row categories classified into columns) 


| Europe Islamic NewWorld 


i 9 0 0 
i 0 9 1 
NewWorld | 1 2 13 
Total } 10 11 14 


$correct 


100 
90 
81 
89 


Classification of Cases with zero weight or frequency 


| Europe Tslamic NewWorld 


Europe | 10 0 0 
Islamic | 0 5 0 
NewWorld | 0 0 5 
$correct | 10 5 5 


Jackknifed Classification Matrix 


Europe Islamic NewWorld 


Europe f 9 0 0 
Islamic | 0 8 2 
NewWorld | 1 2 13 
Total i 10 10 15 


correct 


Toler 


Discriminant Analysis 


ance 


After we add EDUC and HEALTH, the results for the learning sample do not differ 
from those for the previous model. However, for the test sample, the addition of EDUC 
and HEALTH increases the percentage correct from 95% to 100%. 


We continue by issuing the STEP specifications listed a 
total percentage correct as well as the percentages 
groups. After scanning the classification res 
learning sample jackknifed panel, we conclu 


for the I 
ults fro 


bove, eac 


DEATH RT, MIL, EDUC, and HEALTH) is better than model 1. 


Classification of New Cases 


Group membership is known in the current exam 
group membership is unknown? For example, yo 


developed for one sample to a new sample. 


h time noting the 
slamic and New World 
m both the test sample and the 
de that model 2 (BIRTH RT, 


ple. What if you have cases where the 
u might want to apply the rules 
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When the value of the grouping variable is missing, SYSTAT still classifies the 
case. For example, we set the group code for New World countries to missing 


IF GROUP = 3 THEN LET GROUP = 


and request automatic forward Stepping for the model containing BIRTH RT, 
DEATH RT, MIL, EDUC, and HEALTH. 


The input is: 


DISCRIM 
USE OURWORLD 
LET GDP CAP = [10 (GDP CAP) 
LET (EDUC, HEALTH, MIL) - SQR (а) 
LET DIFFRNCE - EDUC - HEALTH 
IF GROUP - 3 THEN LET GROUP = , 
LABEL GROUP / 1="Europe", ="Islamic" 
MODEL GROUP = URBAN BIRTH RT DEATH RT BABYMORT, 


IDVAR COUNTRY$S 
PLENGTH LONG / 
START / FORWARD 
STEP / AUTO 


countries with missing group codes and also the classification matrix. The weight 
variable is not used here. 


The output is: 
Mahalanobis Distance- from Gri 
Posterior Probabilities are Group Menbe s m 
Priors Н 0.500 0.500 
Argentina* i 30.528 1.000 65.7 
Barbados* | | — 26/359 1:000 81:913 0:900 
Bolivia* ! 152.487 0.000 5.974 11000 
Brazil* { 131.459 0.000 10.009 1:000 
Canada* ИЕСІ 1.000 134.379 0.000 
Chile* | 641944 0.000 41.757 1.000 
Colombia* ^ | 212:831 0.000 23.184 1:000 
CostaRica* | 316.856 0.000 61.162 1:000 
Venezuela* ! 310.289 0.000 49.194 1:000 
DominicanR.* | 143.456 0.000 10.847 1.000 
Uruguay* 12131006 1.000 1012052 0:000 
Ecuador* } 160.500 0.000 8.659 1:000 
ElSalvador» | 188.826 0.000 12.531 1.000 
Jamaica* ! 104.674 0.000 35.674 1.000 
Guatemala* — | 160.611 0.000 7.862 1.000 
Haiti* ! 148.921 0.000 1.530 1:000 
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Honduras* | 228.580 0.000 13.378 1.000 
Trinidad* i 132.634 0.000 29.097 1.000 
Peru* | 107.952 0.000 8.232 1.000 
Panama* i 164.893 0.000 21.924 1.000 
Cuba* i 19.721 1.000 81.562 0.000 


--> case misclassified 
* case not used in computation 


Classification Matrix (Cases in row categories classified into columns) 


Europe Islamic ‘correct 


19 0 100 
0 15 100 
19 15 100 


Not Grouped 5.000 16.000 


Argentina, Barbados, Canada, Uruguay, and Cuba are classified as European; the other 
15 countries are classified as Islamic. 


Example 8 
Robust Discriminant Analysis 


This example illustrates the use of robust discriminant analysis with the data selected 
from Lubischew (1962) that have been analyzed by Flury (1997). The data set consists 
of two species of flea beetles with measurements on four variables. The variables are: 


Хі- distance of the transverse groove to the posterior border of the paradox (in 
microns) 

X2=length of the elytra (in mm) 

X3-length of the second antennal point (in microns) 

X4=length of the third antennal joint (in microns) 


We have inserted some outliers in the first group of the data. The two groups in the data 
are Haltica oleracea and H.carduorum. 


The input is: 


RDISCRIM 

USE FLEA 

MODEL SPECIES = X1 X2 X3 X4 
PLENGTH SHORT 

ESTIMATE 
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The output is: 
Group Frequencies 
1 2 
19 20 
Updated Means 
i 1 2 
о fo RES 
X1 | 196.9339 176.5172 
X2 | 266.5029 290.5862 
X3 | 138.7356 157.0690 
X4 | 183.0287 210.8621 
Pooled Reweighted Covariance Matrix 
i х1 x2 x3 X4 
ENDO qe LLL UR EE EYE быы 
Xl | 92.8449 
X2 | 135.8348 384.5340 
X3 | 39.4631 138.0296 129,9713 
X4 | 89.9281 127.0093 46.5813 196.7184 
Discriminant Functions 
i 1 2 
E LC ааа i 
Constant | -238.6844 -205.3111 
х1 H 2.6207 1.5236 
x2 1 -0.4833 -0.1768 
x3 i 0.8404 0.8285 
х4 Н -0.1546 0.2934 
Classification Matrix (Cases in том categories classified into columns) 
ДАР Ч 2 % Correct 
У ла зајам Some 
1 i 16 3 84 
2 78 17 85 
Total | 19 20 85 
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12 
Factor Analysis 


Herb Stenson and Leland Wilkinson 


FACTOR provides principal components analysis and common factor analysis 
(maximum likelihood and iterated principal axis). SYSTAT has options to rotate, sort, 
plot, and save factor loadings. With the principal components method, you can also 
save the scores and coefficients. Orthogonal methods of rotation include varimax, 
equamax, quartimax, and orthomax. A direct oblimin method is also available for 
oblique rotation. Users can explore other rotations by interactively rotating a 3-D 
Quick Graph plot of the factor loadings. Various inferential statistics (for example, 
confidence intervals, standard errors, and chi-square tests) are provided, depending on 
the nature of the analysis that is run. 

Resampling procedures are available in this feature. 


Statistical Background 


Principal components (PCA) and common factor (MLA for maximum likelihood and 
IPA for iterated principal axis) analyses are methods of decomposing a correlation or 
covariance matrix. Although principal components and common factor analyses are 
based on different mathematical models, they can be used on the same data and both 
usually produce similar results. Factor analysis is often used in exploratory data 
analysis to: 
W Study the correlations of a large number of variables by grouping the variables in 
“factors” so that variables within each factor are more highly correlated with 
variables in that factor than with variables in other factors. 


W Interpret each factor according to the meaning of the variables. 


ET 
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= Summarize many variables by a few factors. The scores from the factors can be 
used as input data for / tests, regression, ANOVA, discriminant analysis, and so on. 


Often the users of factor analysis are overwhelmed by the gap between theory and 
practice. In this chapter, we try to offer practical hints. It is important to realize that you 
may need to make several passes through the procedure, changing options each time, 
until the results give you the necessary information for your problem. 

If you understand the component model, you are on the way toward understanding 
the factor model, so let us begin with the former. 


A Principal Component 


What is a principal component? The simplest way to see is through real data. The 
following data consist of Graduate Record Examination verbal and quantitative scores. 
These scores are from 25 applicants to a graduate Psychology department. 


VERBAL QUANTITATIVE 


590 530 
620 620 
640 620 
650 550 
620 610 
610 660 
560 570 
610 730 
600 650 
740 790 
560 580 
680 710 
600 540 
520 530 
660 650 
750 710 
630 640 
570 660 
600 650 
570 570 
600 550 


690 540 
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VERBAL QUANTITATIVE 


770 670 
610 660 
600 640 


Now, we could decide to try linear regression to predict verbal scores from 
quantitative. Or, we could decide to predict quantitative from verbal by the same 
method. The data does not suggest which is a dependent variable; either will do. What 
if we are not interested in predicting either one separately but instead want to know 
how both variables hang together jointly? This is what a principal component does. 
Karl Pearson, who developed principal component analysis in 1901, described a 
component as a “line of closest fit to systems of points in space.” In short, the 
regression line indicates best prediction, and the component line indicates best 
association, 

The following figure shows the regression and component lines for our GRE data. 
The regression of y on x is the line with the smallest slope (flatter than diagonal). The 
regression of x on y is the line with the largest slope (steeper than diagonal). The 
component line is between the other two. Interestingly, when most people are asked to 
draw a line relating two variables in a scatterplot, they tend to approximate the 
component line. It takes a lot of explaining to get them to realize that this is not the best 
line for predicting the vertical axis variable (у) or the horizontal axis variable (x). 


800 


~ 
e 
о 


600 


Quantitative GRE Score 


500 
500 600 700 800 
Verbal GRE Score 
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Notice that the slope of the component line is approximately 1, which means that the 
two variables are weighted almost equally (assuming the axis scales are the same). We 
could make a new variable called GRE that is the sum of the two tests: 


GRE = VERBAL + QUANTITATIVE 


This new variable could summarize, albeit crudely, the information in the other two. If 
the points clustered almost perfectly around the component line, then the new 
component variable could summarize almost perfectly both variables. 


Multiple Principal Components 


The goal of principal components analysis is to summarize a multivariate data set as 


component. It is possible, however, to draw a second component perpendicular to the 
first. The first component will summarize as much of the joint variation as possible. 
The second will summarize what is left. If we do this with the GRE data, of course, we 


Component Coefficients 


In the above equation for computing the first Principal component on our test data, we 
made both coefficients equal. In fact, when you run the sample covariance matrix using 
factor analysis in SYSTAT, the coefficients are as follows: 


GRE = 0.008 + VERBAL + 0.01 * QUANTITATIVE 


They are indeed nearly equal. Their magnitude is considerably less than 1 because 
principal components are usually scaled to conserve variance. That is, once you 
compute the components with these coefficients, the total Variance on the components 
is the same as the total variance on the original variables, 


Component Loadings 


Most researchers want to know the relation between the original variables and the 
components. Some components may be nearly identical to an original variable; in other 
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words, their coefficients may be nearly 0 for all variables except one. Other 
components may be a more even amalgam of several original variables. 

Component loadings are the covariances of the original variables with the 
components. In our example, these loadings are 51.085 for VERBAL and 62.880 for 
QUANTITATIVE, Y ou may have noticed that these are proportional to the coefficients; 
they are simply scaled differently. If you square each of these loadings and add them 
up separately for each component, you will have the variance accounted for by each 
component. 


Correlations or Covariances 


Most researchers prefer to analyze the correlation rather than covariance structure 
among their variables. Sample correlations are simply covariances of sample 
standardized variables. Thus, if your variables are measured on very different scales or 
if you feel the standard deviations of your variables are not theoretically significant, 
you will want to work with correlations instead of covariances. In our test example, 
working with correlations yields loadings of 0.879 for each variable instead of 51.085 
and 62.880. When you factor the correlation instead of the covariance matrix, then the 
loadings are the correlations of each component with each original variable. 

For our test data, loadings of 0.879 mean that if you created a GRE component by 
standardizing VERBAL and QUANTITATIVE and adding them together weighted by 
the coefficients, you would find the correlation between these component scores and 
the original VERBAL scores to be 0.879. The same would be true for QUANTITATIVE. 


Signs of Component Loadings 


The signs of loadings within components are arbitrary. If a component (or factor) has 
more negative than positive loadings, you may change minus signs to plus and plus to 
minus. SYSTAT does this automatically for components that have more negative than 
positive loadings, and thus will occasionally produce components or factors that have 
different signs from those in other computer programs. This occasionally confuses 
users. In mathematical terms, Ax = Ах and -Ax = —Àx are equivalent. 
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We have seen how principal components analysis is a method for computing new 
variables that summarize variation in a space parsimoniously. For our test variables, 
the equation for computing the first component was: 


GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE 
This component equation is linear, of the form: 

Component = Linear combination of {Observed variables} 
Factor analysts turn this equation around: 

Observed variable = Linear combination of {Factors} + Error 


This model was presented by Spearman near the turn of the century in the context of a 
single intelligence factor and extended to multiple mental measurement factors by 


model, but rather its quadratic form: 
Observed covariances = Factor covariances + Error covariances 


The covariances in this equation are usually expressed in matrix form, so that the 
model decomposes an observed covariance matrix into a hypothetical factor 


specificities, 


In ordinary language, then, the factor model expresses variation within and relations 


among observed variables as partly common Variation among factors and partly 
Specific variation among random errors, 
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Estimating Factors 


Factor analysis involves several steps: 


m First, the correlation or covariance matrix is computed from the usual cases-by- 
variables data file or it is input as a matrix. 


m Second, the factor loadings are estimated. This is called initial factor extraction. 
Extraction methods are described in this section. 


ш Third, the factors are rotated to make the loadings more interpretable—that is, 
rotation methods make the loadings for each factor either large or small, not in- 
between. These methods are described in the next section. 


Factors must be estimated iteratively in a computer. There are several methods 
available. The most popular approach, available in SYSTAT, is to modify the diagonal 
of the observed covariance matrix and calculate factors the same way components are 
computed. This procedure is repeated until the communalities reproduced by the factor 
covariances are indistinguishable from the diagonal of the modified matrix. 


Rotation 


Usually the initial factor extraction does not give interpretable factors. One of the 
purposes of rotation is to obtain factors that can be named and interpreted. That is, if 
you can make the large loadings larger than before and the smaller loadings smaller, 
then each variable is associated with a minimal number of factors. Hopefully, the 
variables that load strongly together on a particular factor will have a clear meaning 
with respect to the subject area at hand. 

It helps to study plots of loadings for one factor against those for another. Ideally, 
you want to see clusters of loadings at extreme values for each factor: like what A and 
C are for factor 1, and B and D are for factor 2 in the left plot, and not like E and F in 


the middle plot. 
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о the rotation automatically. There are many criteria 


among component loadings, although Thurstone’s are 
most widely cited. For p variables and т components: 


m Each component should have at least m near-zero loadings. 
= Few components should have nonzero loadings on the same variable. 


SYSTAT provides five methods of rotating loadings: varimax, equamax, quartimax, 
orthomax, and oblimin, 


Principal Components versus Factor Analysis 
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but these are not estimates in the usual statistical sense. This problem arises not 
because factors can be arbitrarily rotated (so can principal components), but because 
the common factor model is based on more unobserved parameters than observed data 
points, an unusual circumstance in statistics. 

In recent years, “maximum likelihood” factor analysis algorithms have been 
devised to estimate common factors. The implementation of these algorithms in 
popular computer packages has led some users to believe that the factor indeterminacy 
problem does not exist for “maximum likelihood” factor estimates. It does. 

Mathematicians and psychometricians have known about the factor indeterminacy 
problem for decades. For a historical review of the issues, see Steiger (1979); for a 
general review, see Rozeboom (1982). For further information refer Harman (1976), 
Mulaik (1972), Gnanadesikan (1977), or Mardia, Kent, and Bibby (1979), Afifi, May, 
and Clark (2004), Clarkson and Jennrich (1988), or Dixon (1992). 

Because of the indeterminacy problem, SYSTAT computes subjects’ scores only 
for the principal components model where subjects’ scores are a simple linear 
transformation of scores on the factored variables. SYSTAT does not save scores from 
a common factor model. 


Applications and Caveats 


While there is not room here to discuss more statistical issues, you should realize that 
there are several myths about factors versus components: 

Myth. The factor model allows hypothesis testing; the component model does not. 
Fact. Morrison (2004) and others present a full range of formal statistical tests for 
components. 

Myth. Factor loadings are real; principal component loadings are approximations. 
Fact. This statement is too ambiguous to have any meaning. It is easy to define things 
so that factors are approximations of components. 


Myth. Factor analysis is more likely to uncover lawful structure in your data; principal 


components are more contaminated by error. қа 

Fact. Again, this statement is ambiguous. With further definition, it can be shown to be 
true for some data, false for other. It is true that, in general, factor solutions will have 
lower dimensionality than corresponding component solutions. This can be an 
advantage when searching for simple structure among полу variables, as long as you 
compare the result to a principal components solution to avoid being fooled by the sort 


of degeneracies illustrated above. 
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Factor Analysis in SYSTAT 


Factor Analysis Dialog Box 


For factor analysis, from the menus choose: 


Analyze 
Factor Analysis... 


ў Analyze: Factor Analysis 


Available variable(s}; Model vatiable(s}: 
POP_1983 а «Required» 
FUP 1986 MN LU 
POP 1990 | Ада-> | 
РОР 2020 === 
URBAN | <= Remove | 
BIRTH_82 
RIRTH RT м 


Method Matrix for extraction 
| © Principal components (РСА) | | (8) Correlation 
O Iterated principal axis (РА) | | O Covariance 
© Maximum likelihood (MLA) E Pairwise deletion 


Display Extraction parameters 
[0] Sort loadings Minimum eigenvalue: m 77 
C Extended results Number of factors: 

Sample size for Iterations 25 


| 


The following options are available: 


Model variables. Variables used to create factors, 
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Method. SYSTAT offers three estimation methods: 

m Principal components analysis (РСА) is the default method of analysis. 

m Iterated principal axis (IPA) provides an iterative method to extract common 
factors by starting with the principal components solution and iteratively solving 
for communalities. 

m Maximum likelihood analysis (MLA) iteratively finds communalities and common 
factors. 

Display. You can sort factor loadings by size or display extended results. Selecting 

Extended results displays all possible factor output. 


Sample size for matrix input. If your data are in the form ofa correlation or covariance 
matrix, you must specify the sample size on which the input matrix is based so that 
inferential statistics (available with extended results) can be computed. 


Matrix for extraction. You can factor a correlation matrix or a covariance matrix. Most 

frequently, the correlation matrix is used. You can also delete missing cases pairwise 

instead of listwise. Listwise deletes any case with missing data for any variable in the 

list. Pairwise examines each pair of variables and uses all cases with both values 

present. 

Extraction parameters. You can limit the results by specifying extraction parameters. 

m Minimum eigenvalue. Specify the smallest eigenvalue to retain. The default is 1.0 
for PCA and IPA (not available with maximum likelihood). Incidentally, if you 
specify 0, factor analysis ignores components with negative eigenvalues (which 
can occur with pairwise deletion). 

= Number of factors. Specify the number of factors to compute. If you specify both 
the number of factors and the minimum eigenvalue, factor analysis uses whichever 
criterion results in the smaller number of components. 

= Iterations. Specify the number of iterations SYSTAT should perform (not 
available for principal components). The default is 25. 

Ш convergence. Specify the convergence criterion (not available for principal 
components). The default is 0.001. 
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Rotation 


This tab specifies the factor rotation method. 


X Analyze:Factor Analysis 


Resampling 


@ No rotation 
O Varimax 
| OE Equamax 
| © Quartimax 
© Ототах Gamma 
© быт Gamma 


The following methods are avai 


m No rotation, Е actors are not 


Ш Varimax. An orthogonal rotation 
that have high loadings on each factor. It simplifi 


lable: 


rotated. 
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The number of variables that load highly on a factor and the number of factors 
needed to explain a variable are minimized. 


Quartimax. A rotation method that minimizes the number of factors needed to 
explain each variable. It simplifies the interpretation of the observed variables. 


Orthomax. Specifies families of orthogonal rotations. Gamma specifies the 
member of the family to use. Varying Gamma changes maximization of the 
variances of the loadings from columns (Varimax) to rows (Quartimax). 
Oblimin. Specifies families of oblique (non-orthogonal) rotations. Gamma 
specifies the member of the family to use. For Gamma, specify 0 for moderate 
correlations, positive values to allow higher correlations, and negative values to 
restrict correlations. 
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Save 


You can save factor analysis results for further analyses. 


* Analyze:Factor Analysis 


| Model | Rotation! Save 


© Do not save results 
© Factor scores 

O Residuals 

© Principal components 
O Factor coefficients 
O Eigenvectors 

О Factor loadings 


[0 Save data with scores 


em [ — — —H 


For the maximum likelihood and iterated principal axis me 
loadings. For the principal components method, select from these options: 
W Do not save results. Results are not saved. 

W Factor scores, Standardized factor Scores 

= Residuals. Residuals for each case. For a Correlation 


actual z score minus the predicted z score using the factor Scores times the loadings 
to get the predicted scores. For a covariance matrix, the residuals are from 
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unstandardized predictions. With an orthogonal rotation, Q and PROB are also 
saved. Q is the sum of the squared residuals, and PROB is its probability. 


W Principal components. Unstandardized principal components scores with mean 0 
and variance equal to the eigenvalue for the factor (only for PCA without rotation). 


m Factor coefficients. Coefficients that produce standardized scores. For a correlation 
matrix, multiply the coefficients by the standardized variables; for a covariance 
matrix, use the original variables. 


ш Eigenvectors. Eigenvectors (only for PCA without a rotation). Use to produce 
unstandardized scores. 


Factor loadings. Factor loadings. 


Save data with scores. Saves the selected item and all the variables in the working 
data file as a new data file. Use with options for scores (not loadings, coefficients, 
or other similar options). 


If you save scores, the variables in the file are labeled FAC TOR(1), FACTOR(2), and 
so on. Any observations with missing values on any of the input variables will have 
missing values for all scores. The scores are normalized to have zero mean and, if the 
correlation matrix is used, unit variance. If you use the covariance matrix and perform 
no rotations, SYSTAT does not standardize the component scores. The sum of their 
variances is the same as for the original data. 

If you want to use the score coefficients to get component scores for new data, 
multiply the coefficients by the standardized data. SYSTAT does this when it saves 
scores, Another way to do cross-validation is to assign a zero weight to those cases not 
used in the factoring and to assign a unit weight to those cases used. The zero-weight 
cases are not used in the factoring, but scores are computed for them. 

When Factor scores or Principal components is requested, 72 and PROB are also saved. 
The former is the Hotelling T? statistic that squares the standardized distance from each 
case to the centroid of the factor space (that is, the sum of the squared, standardized 
factor scores). PROB is the upper-tail probability of 72. Use this statistic to identify 
outliers within the factor space. 72 is not computed with an oblique rotation. 
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Using Commands 


After selecting a data file with USE filename, continue with: 


FACTOR 
MODEL varlist 
SAVE filename / SCORES DATA LOAD COEF VECTORS РС RESID 
ESTIMATE / METHOD = PCA or IPA or MLA , 
LISTWISE or PAIRWISE N-n CORR or COVA , 
NUMBER=n EIGEN=n ITER=n CONV=n SORT , 
ROTATE = VARIMAX or EQUAMAX or QUARTIMAX 
or ORTHOMAX or OBLIMIN 
GAMMA=n SAMPLE = BOOT(m,n) JACK SIMPLE (m,n) 


Usage Considerations 


Types of data. Data for factor analysis can be a cases-by-variables data file, a 
correlation matrix, or a covariance matrix, 


Print options. Factor analysis offers three Categories of output: Short (the default), 
Medium, and Long. Each has Specific output panels associated with it. 
For Short, the default, panels are: Latent roots or eigenvalues (not MLA), initial 


rotated loadings (PCA) or pattern (MLA, IPA) matrix, variance explained by rotated 
components, percentage of total variance explained, an 
components or factors (oblimin only). 


asymptotic 95% confidence limits for the eigenvalues a 
eigenvalues with standard errors, 

With Long, you get the panels listed for Short and Medium, plus: latent vectors 
(eigenvectors) with standard errors (not MLA) and the chi-square test that the number 
of factors is k (MLA only) and factor Coefficients. With an oblimin rotation: direct and 
indirect contribution of factors to variances and the rotated structure matrix. 


Quick Graphs. Factor analysis produces a scree Plot and a factor loadings plot. 
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Saving files. You can save factor scores, residuals, principal components, factor 
coefficients, eigenvectors, or factor loadings as a new data file. For the iterated 
principal axis and maximum likelihood methods, you can save only factor loadings. 
You can save only eigenvectors and principal components for unrotated solutions using 
the principal components method. 


BY groups. Factor analysis produces separate analyses for each level of any BY 
variables. 

Case frequencies. Factor analysis uses FREQUENCY variables to duplicate cases for 
rectangular data files. 


Case weights. For rectangular data, you can weight cases using a WEIGHT variable. 


Examples 


Example 1 
Principal Components 


Principal components (PCA, the default method) is a good way to begin a factor 

analysis (and possibly the only method you may need). If one variable is a linear 

combination of the others, the program will not stop (MLA and IPA both require a 

nonsingular correlation or covariance matrix). The PCA output can also provide 

indications that: 

= One or more variables have little relation to the others and, therefore, are not suited 
for factor analysis - so in your next run, you might consider omitting them. 

= The final number of factors may be three or four and not double or triple this 
number. 


To illustrate this method of factor extraction, we borrow data from Harman (1976), | 
who borrowed them from а 1937 unpublished thesis by Mullen. This classic data set is 
widely used in the literature. For example, Jackson (2003) reports loadings for the 
PCA. MLA. and IPA methods. The data are measurements recorded for 305 youth aged 
seven to seventeen: height, arm span, length of forearm, length of lower leg, weight, 
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bitrochanteric diameter (the upper thigh), girth, and width. Because the units of these 
measurements differ, we analyze a correlation matrix: 


Height Arm Span Forearm Lowerleg Weight Віго Girth Width 
Height 1.000 
Arm_Span 0.846 1.000 


Forearm 0.805 0.881 1.000 

Lowerleg 0.859 0.826 0.801 1.000 

Weight 0.473 0.376 0.380 0.436 1.000 

Bitro 0.398 0.326 0.319 0.329 0.762 1.000 

Girth 0.301 0.277 0.237 0.327 0.730 0.583 1.000 

Width 0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.000 


The correlation matrix is stored in the YOUTH file. SYSTAT knows that the file 
contains a correlation matrix, so no special instructions are needed to read the matrix. 


The input is: 


FACTOR 
USE YOUTH 
MODEL HEIGHT. - WIDTH 
ESTIMATE / METHOD-PCA N-305 SORT ROTATE-VARIMAX 


4 5 

4.6729 1.7710 0.4810 0.4214 0.2332 
6 7 8 
0.1867 0.1373 0.0965 
Component Loadings 

Н 1 2 
ПЕРИ а ара лави ets ма 
HEIGHT | 0.8594 0.3723 
ARM SPAN | 0 8416 0.4410 
LOWERLEG ! 0 8396 0.3953 
FOREARM | 0.8131 0.4586 
WEIGHT : 0.7580 70.5247 
BITRO | 0.6742 -0.5333 
WIDTH 0.6706 -0.4185 
GIRTH 0.6172 -0.5801 
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Factor Analysis 
Variance Explained by Components 
1 2 
4.6729 1.7710 
Percent of Total Variance Explained 
1 2 
58.4110 22.1373 
Rotated Loading Matrix (VARIMAX, Gamma = 1.000000) 
i 1 2 
— ан HAE e Менка зе << 
ARM SPAN | 0.9298 0.1955 
FOREARM | 0.9191 0.1638 
HEIGHT | 0.8998 0.2599 
LOWERLEG | 0.8992 0.2295 
WEIGHT 1 0.2507 0.8871 
BITRO | 0.1806 0.8404 
GIRTH | 0.1068 0.8403 
WIDTH 1 0.2509 0.7496 
"Variance" Explained by Rotated Components 
2) 2 
3.4973 2.9465 
Percent of Total Variance Explained 
1 2 
43.7165 36.8318 
Scree Plot Factor Loadings Plot 
1.0 
wor 
0.5 
т 
N 
5 0.0 
ш 
0.5 4 
100-205 00 05 10 
4123 4 5 6 КЕ Factor(1) 
Number of Factors 


Notice that we did not specify how many factors we wanted. For PCA, the assumption 
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is to compute as many factors as there аге eigenvalues greater than 1.0—so, in this run, 
you study results for two factors. After examining the output, you may want to specify 
a minimum eigenvalue or, very rarely, a lower limit. 

Unrotated loadings (and orthogonally rotated loadings) are correlations of the 
variables with the principal components (factors). They are also the ei genvectors ofthe 
correlation matrix multiplied by the square roots of the corresponding eigenvalues. 
Usually these loadings are not useful for interpreting the factors. For some industrial 
applications, researchers prefer to examine the eigenvectors alone. 

The Variance explained for each component is the eigenvalue for the factor. The 
first factor accounts for 58.4% of the variance; the second, 22.1%. The Total Variance 
is the sum of the diagonal elements of the correlation (or covariance) matrix. By 
summing the Percent of Total Variance Explained for the two factors 
(58.411 + 22.137 = 80.548 ) you can say that more than 80% of the variance of all 


method is used. 
To interpret each factor, look for variables with high loadings. The four variables 
that load highly on factor 1 can be said to measure "lankiness"; while the four that load 


highly on factor 2, "stockiness." Other data sets may include variables that do not load 
highly on any specific factor. 
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Maximum Likelihood 


Factor Analysis 


This example uses maximum likelihood for initial factor extraction and 2 as the 
number of factors. Other options remain as in the principal components example. 


The input is: 
FACTOR 


USE YOUTH 
MODEL HEIGHT. .WIDTH 


ESTIMATE / METHOD=MLA N=305 NUMBER=2 


The output is: 


Initial Communality Estimates 


0.8162 0.8493 


Iterative Maximum Likelihood Factor Analysis: Convergence = 0.0010. 


Iterations History 


0.8006 


Maximum Change 


Iteration 


in SQRT 


(uniqueness) 


0.7884 


0.7488 


Negative log of 
Likelihood 


0.9823 0.9489 


Factor Pattern 


HEIGHT 0 
ARM SPAN ; 0 
LOWERLEG ) 0.8551 
FOREARM 0 
WEIGHT | 0 
BITRO ! 
WIDTH 
GIRTH 


4.4337 1.5179 


Communality 
Estimates 


0.6041 


Specific 
Variances 


SORT ROTATE=VARIMAX 
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Percent of Total Variance Explained 


55.4218 18.9742 
Rotated Pattern Matrix (VARIMAX, Gamma = 1.000000) 


i 1 2 
EUN төбе аа... 
АЕМ 5РАМ 0.9262 0.1873 
FOREARM 0.8942 0.1853 
HEIGHT 0.8628 0.2928 


LOWERLEG ! 0.8569 0.2576 


WEIGHT 0.2268 0.9271 
BITRO 0.1891 0.7750 
GIRTH 0.1289 0.7530 
WIDTH 1 0.2734 0.6233 


"Variance" Explained by Rotated Factors 


3.3146 2.6370 


Percent of Total Variance Explained 


41.4331 32.9628 
Percent of Common Variance Explained 


55.6927 44.3073 


Factor Loadings Plot 

2 

1 WEIGHT 
p WIDTH 
N 
= 
о 
5 0 
ш 

-1 

-2 

-2 -1 0 1 2 


Factor(1) 
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The first panel of output contains the communality estimates. The communality of a 
variable is its theoretical squared multiple correlation with the factors extracted. For 
MLA (and IPA), the assumption for the initial communalities is the observed squared 
multiple correlation with all the other variables. 

The canonical correlations are the largest multiple correlations for successive 
orthogonal linear combinations of factors with successive orthogonal linear 
combinations of variables. These values are comfortably high. If, for other data, some 
of the factors have values that are much lower, you might want to request fewer factors. 

The loadings and amount of variance explained are similar to those found in the 
principal components example. In addition, maximum likelihood reports the 
percentage of common variance explained. Common variance is the sum of the 
communalities. If A is the unrotated MLA factor pattern matrix, common variance is 
the trace of A’ A. 


Number of Factors 


In this example, we specified two factors to extract. If you were to omit this 
specification and rerun the example, SYSTAT adds this report to the output 


The Maximum Number of Factors for your Data is 4. 


SYSTAT will also report this message if you request more than four factors for these 
data. This result is due to a theorem by Lederman and indicates that the degrees of 
freedom allow estimates of loadings and communalities for only four factors. 


If we set the print length to long, SYSTAT reports: 

Chi-square Test that the Number of Factors is 4 

Chi-square : 4.3187 

if : 2.0000 

p-value : 0.1154 

ect the hypothesis that 


The results of this chi-square test indicate that you do not reject th 
esis is that “no more than 


there are four factors (p-value > 0.05). Technically, the hypoth 
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four factors are required." This, of course, does not negate 2 as the right number. For 
the YOUTH data, here are rotated loadings for four factors: 


Rotated Pattern Matrix (VARIMAX, Gamma = 1.000000) 


! 1 2 З 4 
чт--т------ VOTE Te vede rns uL. tee E EET NET NODUM 
ARM SPAN | 0.9372 0.1984 -0.2831 0.0465 
LOWERLEG | 0.8860 0.2142 0.1878 0.1356 
HEIGHT i 0.8776 0.2819 0.1134 -0.0077 
FOREARM ! 0.8732 0.1957 -0.0851 70.0065 
WEIGHT i 0.2414 0.8830 0.1077 0.1080 
BITRO + 0.1823 0.8233 0.0163 -0.0784 
GIRTH i 0.1133 0.7315 -0.0048 0.5219 
WIDTH 1 0.2597 0,6459 -0.1400 0.0819 


The loadings for the last two factors do not make sense. Possibly, the fourth factor has 
one variable, GIRTH, but it still has a healthier loading on factor 2. This test is based 


on an assumption of multivariate normality (as is MLA itself). If not true, then the test 
is invalid, 


Example 3 
Iterated Principal Axis 


This example continues with the YOUTH data described in the principal components 
ample, this time using the IPA (iterated principal axis) method to extract factors. 


The input is; 


FACTOR 
USE YOUTH 
MODEL HEIGHT. - WIDTH 
ESTIMATE / METHOD=IPA SORT ROTATE=VARIMAX 


The output is: 
Initial Communality Estimates 
1 2 3 4 5 6 7 8 


0.8162 0.8493 0.8006 0.7884 0.7488 


0.6041 0.5622 0.4778 


Iterative Maximum Likelihood Factor Analysis: Convergence - 0.0010. 


Iterations History 


Maximum Change 


Iteration in SQRT 
Number (uniqueness) 

1 0.7226 

2 0.2438 

3 0.0512 

4 0.0104 

5 0.0005 


Canonical Correlations 


0.9823 0.9489 


Factor Pattern 


HEIGHT 0.8797 0.2375 
ARM SPAN ! 0.8735 0.3604 
LOWERLEG ! 0.8551 0.2633 
FOREARM ! 0.8458 0.3442 


WEIGHT | 0.7048 -0.6436 


BITRO 0.5887 -0.5383 
WIDTH 0.5743 -0.3653 
GIRTH 0.5265 -0.5536 


Variance Explained by Factors 


4.4337 1.5179 
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Negative log of 


Percent of Total Variance Explained 


55.4218 18.9742 


Likelihood 
0.3841 
0.2733 
0.2537 
0.2532 
0.2532 
Communality Specific 
Estimates Variances 
0.8302 0.1698 
0.8929 0.1071 
0.8006 0.1994 
0.8338 0.1662 
0.9109 0.0891 
0.6363 0.3637 
0.4633 0.5367 
0.5837 0.4163 


Rotated Pattern Matrix (VARIMAX, Gamma = 1.000000) 


ARM SPAN 


FOREARM 
HEIGHT 


* 

1 0.9262 0.1873 

1 0.8942 0.1853 

1 0.8628 0.2928 
LOWERLEG | 0.8569 0.2576 

i 


WEIGHT 0.2268 0.9271 
BITRO 0.1891 0.7150 
GIRTH 0.1289 0.7530 
WIDTH 0.2734 0.6233 


"Variance" Explained by Rotated Factors 


Factor Analysis 
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Percent of Total Variance Explained 


41.4331 32.9628 


Percent of Common Variance Explained 


55.6927 44.3073 


Factor Loadings Plot 


1.0 


0.5 


-1.0 -0.5 0.0 0.5 1.0 
Ғасіог(1) 


communality is less than that specified with Convergence, Replacing the diagonal of 
the Correlation (or covariance) matrix with these final communality estimates and 
computing the eigenvalues yields the latent roots in the next panel, 


Example 4 


Rotation 


components example with those from an oblique rotation. 
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The input is: 


FACTOR 
USE YOUTH 
PLENGTH LONG 
MODEL HEIGHT. .WIDTH 
ESTIMATE / METHOD=PCA N=305 SORT 


MODEL HEIGHT. .WIDTH 
ESTIMATE / METHOD=PCA N=305 SORT ROTATE=VARIMAX 


MODEL HEIGHT. .WIDTH 
ESTIMATE / METHOD=PCA N=305 SORT ROTATE=OBLIMIN 


We focus on the output directly related to the rotations. 


The output is: 
Component Loadings 
! 1 2 
Ж жатта eS Ne ты 
HEIGHT | 0.8594 0.3723 
ARM SPAN ; 0.8416 0.4410 
LOWERLEG | 0.8396 0.3953 
FOREARM | 0.8131 0.4586 
WEIGHT 10.7580 -0.5247 
BITRO 1 0.6742 -0.5333 
WIDTH ! 0.6706 -0.4185 
GIRTH | 0.6172 -0.5801 


Variance Explained by Components 


58.4110 22.1373 
Rotated Loading Matrix (VARIMAX, Gamma = 1.000000) 


{ 1 2 
асва ИЕК т. 
АВМ 5РАМ ! 0.9298 0.1955 
FOREARM 0.9191 0.1638 
HEIGHT ) 0.8998 0.2599 
LOWERLEG ! 0.8992 0.2295 
WEIGHT 0.2507 0.8871 
BITRO 0.1806 0.8404 
GIRTH 0.1068 0.8403 
WIDTH 0.2509 0.7496 
"Variance" Explained by Rotated Components 


3.4973 2.9465 
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Percent of Total Variance Explained 


43.7165 36.8318 
Rotated Pattern Matrix (OBLIMIN, Gamma = 0.000000) 


0.0876 0.7487 
"Variance" Explained by Rotated Components 


3.5273 2.9166 
Percent of Total Variance Explained 


44.0913 36.4569 


Direct and Indirect Contributions of Factors to Variance 
1 2 


2 0.0186 2.8979 
Rotated Structure Matrix 


FOREARM | 0.9325 0 3629 
LOWERLEG | 0.9277 0.4225 
HEIGHT i 0.9350 0,4523 
WEIGHT ! 0.4407 0,9206 
GIRTH + 0.2900 0.8431 
BITRO 1 0.3620 0,8596 
WIDTH i 0.4104 0.7865 
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Factor Loadings Plot 
No Rotation 


Factor(2) 


Factor Loadings Plot 
Varimax Oblimin 


"o 05 00 05 10 9o 05 00 
Factor(1) Factor(1) 


Factor Loadings Plot 


1.0 


Factor Analysis 


The values in Direct and Indirect Contributions of Factors to Variance are useful for 


determining if a part of a factor’s contribution to 
correlation with another factor. Notice that 


3.5087 + 0.0186 = 3.5273 
is the “Variance” Explained for factor 1, and 
2.8979 + 0.0186 = 2.9165 


is the “Variance” Explained for factor 2. 
Think of the values in the Rotated Struc 

with the factors. Here we вес that the first fo 

first factor. The remaining variables are 


“Variance” Explained is due to its 


ture Matrix as correlations of the variable 
ur variables are highly correlated with the 
highly correlated with the second factor. 
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The factor loading plots illustrate the effects of the rotation methods, While the 
unrotated factor loadings form two distinct clusters, they both have strong positive 
loadings for factor 1. The “lanky” variables have moderate positive loadings on factor 
2 while the “stocky” variables have negative loadings on factor 2. With the vari max 
rotation, the “lanky” variables load highly on factor 1 with small loadin gs on factor 2; 
the “stocky” variables load highly on factor 2. The oblimin rotation does a much better 
job of centering each cluster at 0 оп its minor factor, 


Example 5 
Factor Analysis Using a Covariance Matrix 


Jackson (1991) describes a project in which the maximum thrust of ballistic missiles 
was measured. For a specific measure called “total impulse,” it is necessary to 


calculate the area under a curve. Originally, a planimeter was used to obtain the area, 
and later an electronic device performed the integration directly but unreliably in its 


In this example, we illustrate features associated with covariance matrix input 
(asymptotic 95% confidence limits for the eigenvalues, estimates of the population 


‚ and latent vectors (eigenvectors or characteristic 


The input is: 


FACTOR 
USE MISSLES 


MODEL INTEGRA1 PLANMTR1 INTEGRA2 PLANM 
PLENGTH LONG 2% 


ESTIMATE / МЕТНОр-РСА COVA N=40 
The output is: 


Latent Roots (Eigenvalues) 
1 2 3 


335.3355 48.0344 29.3305 16.4096 


Empirical Upper Bound for the First Eigenvalue : 398.0000 


Asymptotic 95% Confidence Limits for the Eigenvalues N 40 
, " 40 


596.9599 85.5102 52.2138 29.2122 
Lower Limits 

233.1534 33.3975 20.3930 11.4093 

Unbiased Estimates of Population Eigenvalues 

1 2 3 4 


332.6990 46.9298 31.0859 18.3953 


Unbiased Estimates of Standard Errors of Eigenvalues 
1 2 3 4 


74.9460 10.1768 5.7355 3.2528 


Chi-square Test that All Eigenvalues аге Equal 


N : 40.0000 
Chi-square : 110.6871 
df : 9.0000 


p-value : 0.0000 
Latent Vectors (Eigenvectors) 


* 
INTEGRA] | 0.4681 0.6215 0.5716 0.2606 


PLANMTR1 0.6079 0.1788 -0.7595 0.1473 
ІМТЕСВА2 0.4590 -0.1387 0.1677 -0.8614 
PLANMTR2 0.4479 -0.7500 0.2615 0.4104 


Standard Error for Each Eigenvector Element 


| 1 2 3 4 
— NL XS б. ал Lim xen v 
INTEGRAl ! 0.0532 0.1879 0.2106 0.1773 
PLANMTRI | 0.0412 0.2456 0.0758 0.2066 
INTEGRA2 | 0.0342 0.1359 0.2366 0.0519 
PLANMTR2 | 0.0561 0.1058 0.2633 0.1276 


Component Loadings 


* 
INTEGRAL | 8.5727 4.3072 3.0954 1.0559 
PLANMTR1 | 11.1325 1.2389 -4.1131 0.5965 
INTEGRA2 | 8.4051 -0.9616 0.9084 -3.4893 
PLANMTR2 | 8.2017 -5.1983 1.4165 1.6625 
Variance Explained by Components 
1 2 3 4 


335.3355 48.0344 29.3305 16.4096 
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Percent of Total Variance Explained 
78.1467 11.1940 6.8352 3.8241 
Differences: Original Minus Fitted Correlations or Covariances 


| INTEGRA1 PLANMTR1 INTEGRA2 PLANMTR2 
+ 


INTEGRA] | 0.0000 
PLANMTR1 | 0.0000 0.0000 
INTEGRA2 | 0.0000 0.0000 0.0000 
PLANMTR2 | 0.0000 0.0000 0.0000 0.0000 
Scree Plot Factor Loadings Plot 
400 FACTOR) ^ FACTORQ) FACTOR) ^ FACTOR(4) 
= 4 ES 
Е а 
5 = 
300 © = 
a m 
3 d 5 
E 5 8 
č 200 = 5 
5 
= = т 
üi H4 5 
"lo 
100 2 8 
s z 
o 
0 d E 
0 1 2 3 4 5 č E 


Number of Factors FACTOR!) FACTOR) ғастоба) FACTOR) 


SYSTAT performs a test to determine if all eigenvalues are equal. The null hypothesis 
is that all eigenvalues are equal against an alternative hypothesis that at least one root 
is different. The results here indicate that you reject the null hypothesis (p « 0.00005). 
At least one of the eigenvalues differs from the others. 

The size and sign of the loadings reflect how the factors and variables are related. 
The first factor has fairly similar loadings for all four variables. You can interpret this 
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loadings for factors 2 through 4 convey little information (notice that values in the 
stripe displays along the diagonal concentrate around 0, while those for factor | fall to 
the right). 


Example 6 
Factor Analysis Using a Rectangular File 


Begin this analysis from the OURWORLD cases-by-variables data file. Each case 
contains information for one of 57 countries. We will study the interrelations among a 
subset of 13 variables including economic measures (gross domestic product per capita 
and U.S. dollars spent per person on education, health, and the military), birth and 
death rates, population estimates for 1983, 1986, and 1990 plus predictions for 2020, 
and the percentages of the population who can read and who live in cities. 

We request principal components extraction with an oblique rotation. As a first step, 
SYSTAT computes the correlation matrix. Correlations measure linear relations. 
However, plots of the economic measures and population values as recorded indicate 
a lack of linearity, so you use base 10 logarithms to transform six variables, and you 
use square roots to transform two others. 


The input is: 


FACTOR 

USE OURWORLD 

LET (GDP_CAP, GNP_86, POP_1983, 
= L10(9) 

LET (MIL,EDUC) = SQR(@) 

MODEL URBAN BIRTH_RT DEATH_RT GDP_CAP GNP_86 MIL, 
EDUC B TO D LITERACY POP_1983 POP_1986, 
POP_1990 POP_2020 

PLENGTH MEDIUM 


SAVE pcascore / SCORES 
ESTIMATE / METHOD=PCA SORT ROTATE=OBLIMIN 


POP_1986, POP_1990, POP_2020), 


The output is: 


Matrix to be Factored 
URBAN BIRTH RT DEATH RT GDP CAP GNP 86 


URBAN ! 1.0000 

BIRTH RT | -0.8002 1.0000 

DEATH RT | -0.5126 0.5110 1.0000 

GDP САР | 0.7636 -0.9189 -0.4012 1.0000 

GNP 86 | 0.7747 -0.8786 -0.4518 0.9736 1.0000 
MIL ! 0.6453 -0.7547 -0.1482 0.8657 0 6324 
EDUC ! 0.6238 -0.7528 -0.2151 0.8996 0.920 
B TOD | -0.3074 0.5106 -0,4340 -0.5293 -0.4411 
LITERACY | 0.7997 -0.9302 -0.6601 0.8337 0.8404 
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+ 0.2133 70.0836 0.0152 0.0583 
РОР 1986 | 0.1898 70.0523 0.0291 0.0248 
‘ 


Р 1990 ! 0.1700 -0.0252 0.0284 -0.0015 
POP 2020 0.0054 0.1880 0.0743  -0.2116 
| о: 
| 0. 
| 0. 
ко: 

T | 0 -0.1070 -0.0534 
РОР 2020 ! -0.0339 -0.2555 0.0617 -0.2360 
i РОР 1986 РОР 1990 РОР 2020 
жене зі. ia 

POP 1986 | 
РОР 1990 | 


POP 2020: 0.9605 0.9673 1.0000 


1 2 3 4 5 
6.3950 4.0165 1.6557 0.4327 9.2390 


11 12 13 


0.0054 0.0012 0.0002 


0.0090 
-0.0215 
70.0447 
-0.2484 


POP_1983 


Empirical Upper Bound for the First Eigenvalue : 7.4817 


Chi-square Test that All Eigenvalues are Equal 


N : 49.0000 
Chi-square : 1542.2903 
df : 78.0000 
P-value : 0.0000 


Chi-square Test that the Last 10 Eigenvalues are Equal 


Chi-square ; 636.4350 
df : 59.8885 
P-value : 0.0000 


Component Loadings 

Н 1 2 3 
M Крл ЕУ алышы куы SS IR 
GDP CAP 0.9769 -0.0366 -0.0606 
GNP 86 0.9703 70.0846 0.0040 
BIRTH RT 70.9512 0.0136 70.0774 
LITERACY 0.8972 0,1008 0.3004 
EDUC 0.8927 -0.0857 -0.2296 


; 
| 
Н 
| 
MIL 1 0.8770 0/1501 2012206 
| 
| 
! 
| 
| 
| 


URBAN 0.8393 0.1425 0.2300 
в тор 70.5166 -0,1225 0.7762 
РОР 1990 0.0382 0.9972 0.0394 
РОР 1986 0.0636 0.9966 0.0253 
РОР 1983 0.0945 0.9940 0.0248 


Factor Analysis 


РОР 2020 | -0.1796 0.9748 0.1002 


DEATH RT | -0.4533 0.0820 -0.8662 
Variance Explained by Components 
1 2 3 


6.3950 4.0165 1.6557 
Percent of Total Variance Explained 


49.1924 30.8964 12.7361 
Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0000) 


BIRTH_RT 
EDUC 


Н 

! 

| 
LITERACY | 
MIL | 
URBAN i 4 X. 
BTOD | 2 ene ; 
POP 1990! А 1 
РОР 1986 | 0.0491 4 
POP 1983 | 0.0801 0.9932 | | 4 
POP 2020 | -0.1945 0.9805 қ 
DEATH RT | -0.4459 -0.0011 ) 4 
"Variance" Explained by Rotated Components бұзы 
x 2 3 {+ 
6.3946 4.0057 1.6669 tes Tot и 
Percent of Total Variance Explained 4 
1 2 ЕЈ t у 


49.1895 30.8129 12.8225 
Correlations Among Oblique Factors or Components 


1 2 9 
1 1.0000 
2 0.0127 1.000 
3 -0.0020 0.0452 1.0000 
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| 
Scree Plot Factor Loadings Plot 
7 ER eo | 
6 
5 = 
© g 
54 i 
c w 
93 
Ш 
2 
1 
% 5 10 15 
Number of Factors 


By default, SYSTAT extracts three factors because three ei 

1.0. On factor 1, seven or eight variables have high loadings. The eighth 

ratio of birth-to-death rate) has a higher loading on factor 3. With the ех 

BIRTH RT, the other variables are economic measures, so let us identify this as the 
"economic" factor. Clearly, the second factor can be named "population," and the third, 


The economic and population factors account for 80% (49.19 + 30.8 1) of the total 


be useful for characterizing 
differences among the countries. The third factor accounts for 13% of the total 


ctors. Notice, too, that only 7% 


of the total variance is not accounted for by these three factors, 


Revisiting the Correlation Matrix 


loadings for the factor on which they load the highest, 


The input is: 


CORR 


USE OURWORLD 
LET (GDP_CAP, GNP_86, POP_1983, POP_1986, POP_1990, 
Ы = 110 (8) 
LET (MIL,EDUC) = SQR(@) . 
PEARSON GDP_CAP GNP_86 BIRTH_RT EDUC LITERACY MIL URBAN , 


POP 2020) 


POP_1990 PO 


The output is: 


Pearson Correlation Matrix 


GDP_CAP 
GNP 86 


BIRTH RT | 


EDUC 
LITERACY 
MIL 
URBAN 
POP 1990 
POP 1986 
POP 1983 
POP 2020 
B TO D 
DEATH RT 


GDP CAP 
GNP 86 
BIRTH RT 
EDUC . 
LITERACY 
MIL 
URBAN 
POP 1990 
РОР 1986 


* 


П 
i 
' 
i 
' 
i 
' 
П 
i 
' 
i 
i 


E S — 


РОР 1983 | 


РОР 2020 
B TO D 
DEATH RT 


GDP CAP 


.0000 
.1700 
.1898 
.2133 
.0054 
-0.3074 
-0.5126 


сосоосе- 


.0000 
.9992 
.9966 
.9673 
.1070 
.0284 
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1.0000 
0.9984 
0.9605 
-0.1358 
0.0291 


Pearson Correlation Matrix (contd...) 


GDP_CAP 
GNP 86 


BIRTH_RT 


EDUC 
LITERACY 
MIL 
URBAN 


+ 
' 
' 
n 


РОР 1990 | 
РОР 1986 | 


РОР 1983 


РОР "2020 ! 


B_TO D 
DEATH_RT 


DEATH_RT 


1.0000 


1.0000 
0.9531 
-0.1526 
0.0152 


LITERACY 


-0.6601 
POP_2020 


1.0000 
0.0617 
0.0743 


Factor Analysis 


P 1986 РОР 1983 РОР 2020 B TO D DEATH RT 


-0.6184 
-0.1482 


B TO D 


1.0000 
-0.4340 
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the first letter of their group membership). Finally, use SPLOMs to display the scores, 
adding 75% confidence ellipses for each subgroup in the plots and normal curves for 


The input is: 


MERGE PCASCORE. SYD (FACTOR (1) FACTOR (2) FACTOR (3)), 
OURWORLD . SYD ( GROUPS COUNTRYS) 
PLOT FACTOR (2) *FACTOR(1) / XLABEL- 'Economic' 


YLABEL=' Population! SYMBOL=4, 2,3, 
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.250 
PLOT FACTOR (3) *FACTOR (1) / XLABEL-'Economic: р 
YLABEL-'Death Rate! COLOR-2,1,10, 
SYMBOL-GROUP$ SIZE- 1.250 ,1.250 ‚1.250 
SPLOM FACTOR (1) FACTOR (2) FACTOR (3) / GROUP-GROUPS OVERLAY, 
DENSITY-NORMAL ELL 20.750, 
COLOR=2,1,10 SYMBOL=4, 2,3, 
DASH=1,1, 4 
SPLOM FACTOR (1) FACTOR (2) FACTOR (3) / GROUP=GROUP$ OVERLAY, 
DENSITY=KERNEL COLOR=2, 1,10, 
SYMBOL-4,2,3 DASH-1,1,4 
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m 
T 


FACTOR(1) 


FACTOR(2) 


m 


FACTOR(3) 


~ 


Factor Analysis 


The output is: 


о 
T 


РАСТОВИ) —FACTORQ) _ FACTOR(3) 


Т ape Y 
A Borgia n 
mm ХУ” ar 
a= j А Venezuela. Canada) 
дамыт, м 
2 онй alley 
ноодон wr ды 
meen 
Атам 
Gente 1 
1 I Дм _ 
2 1 0 1 2 
Есопотіс 


FACTOR()  FACTOR(2)  FACTOR(3) 


GANIE 8: "ЈЕ 

У | d 
x GROUPS [2] Ё e. 
„АА S, | дей 


FACTOR(1) ҒАСТОН(2) FACTOR(3) 


High loadings on the “economic” factor show countries that are strong economically 
(Germany, Canada, Netherlands, Sweden, Switzerland, Denmark, and Norway) 
relative to those with low loadings (Bangladesh, Ethiopia, Mali, and Gambia). Not 
surprisingly, the population factor identifies Barbados as the smallest and Bangladesh, 
Pakistan, and Brazil as largest. The questionable third factor (death rate) does help to 
separate the New World countries from the others. 

In each SPLOM, the dashed lines marking curves, ellipses, and kernel contours 
identify New World countries. The kernel contours in the plot of factor 3 against factor 
| identify a pocket of Islamic countries within the New World group. 
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Computation 


Algorithms 


Provisional methods are used for computing covariance or correlation matrices (see 
Correlations for references). Components are computed by using a Householder 
tridiagonalization and implicit QL iterations, Rotations are computed with a variant of 
Kaiser’s iterative algorithm, described in Mulaik (1972). 


Missing Data 


Ordinarily, Factor Analysis and other multivariate procedures delete all cases having 
missing values on any variable selected for analysis. This is listwise deletion. For data 
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13 
Fitting Distributions 


Mangalmurti Badgujar and S. Anoopama 


The Fitting Distributions feature can be used to try and fit an appropriate distribution 
to your data. SYSTAT provides a wide variety of univariate discrete and continuous 
distributions that include, besides the standard distributions, distributions such as 
Gumbel, Gompertz, Weibull, inverse Gaussian, Zipf, Rayleigh, etc. Two procedures 
are available for testing the goodness of fit: Chi-square goodness-of-fit and the 
Kolmogorov-Smirnov test. For normal, lognormal, and logit normal distributions, the 


Shapiro-Wilk test is also available. 
You can visually compare the observed distribution with the theoretical 
use a bar plot of theoretical 


distribution using suitable plots. For discrete distributions, 
probabilities using estimated and/or specified parameters and relative frequencies. 
For continuous distributions, use the density function computed using estimated or 
specified parameters, and overlaid on a histogram of the data. 

In case the parameters are estimated, the p-value associated with the Kolmogorov- 
Smirnov statistic gives an overestimate. In such cases, interpret the p-value with 


caution. 


Statistical Background 


Probability distributions are often successfully used to model data obtained under 


conditions of uncertainty. For example, in radioactive particle emission trials, the 
Poisson distribution may be an appropriate model to describe the radioactive decay 
data. Likewise, there are numerous instances wherein the normal, exponential, and 


Weibull distributions are used to model the data. The distribution model, with suitably 


= Аме 
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estimated parameter values, may be considered to be a smoothed version of the data 
and may form a useful description of the data, 

There is another area where distributions play a key role. Most classical statistical 
methods are based on certain distributional assumptions. For instance, in regression 
analysis and ANOVA, various tests of hypotheses are performed assumin g that the 
data follow a normal distribution. In survival analysis, the data are assumed to be from 
an exponential or a gamma distribution, Naturally, the validity of the results of the 
analyses depends on the validity of such distributional assumptions. It is therefore 
essential to check if your data follow the assumed distribution. 

The Fitting Distributions feature enables you to assess whether the observed data 
can be modeled by a distribution from a parametric family of distributions with 
appropriately chosen parameter values. For this purpose, SYSTAT provides thirty 
univariate discrete and continuous distributions (a list is given below). When your data 
are from a continuous distribution, SYSTAT allows you to fit more than one 
distribution at a time, and, based on the goodness-of-fit tests results, you can choose 
an appropriate distribution to model your data 


Goodness-of-Fit Tests 


Two tests procedures for assessing the goodness-of-fit are provided: the chi-square 
goodness-of-fit test and the one-sample Kolmogorov-Smirnov test, For normal, 
lognormal, and logit normal distributions, the Shapiro-Wilk test is also available. 


| t "Square statistic along with the p-value. This test can 
be applied to discrete as well as continuous distributions; however 
sufficiently large sample to gen 4 
is а large-sample test. 


The Kolmogorov-Smirnov test compares the Observed (empirical) distribution 
function to the hypothesized distribution function. The test statistic is based on the 


en these two distribution functions. In both the 


you need a 
erate a reasonable frequency distribution because this 
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procedures, the parameters of the selected distribution, if not specified, are estimated 
from the sample using the methods shown below: 


Distribution(s) Method of estimation 


Beta, Chi-square, Erlang, Gamma, Gumbel, Logistic, Method of moments 
Loglogistic, Smallest extreme value 
Binomial, Poisson, Geometric, Hypergeometric, 
Negative binomial, Discrete uniform, Zipf, Normal, 
Lognormal, Logit normal, Exponential, Maximum likelihood method 
Double exponential, Gompertz, Inverse Gaussian, Pareto, 
Rayleigh, Weibull, Uniform, Benford’s law, 
Logarithmic series 
Method of quantiles or order statis- 


Cauchy me 
; Modified maximum likelihood and 
Triangular moments 


Shapiro and Wilk's W Test for Normality 


The Shapiro- Wilk (1965) test is used to check if a random sample У-(у,У>-,У» 
comes from a normal distribution. The test statistic W is given by: 


n 2 n 
=| За» 201594 
іі ігі 


where у/;) аге the ordered observations, 
TA -mylimy Vm] 95 
a’ = (aj, a5, .., a) =m'V [mV У т з 
m is the vector of expected values of standard normal ог 
sample size n, and V is the corresponding covariance matrix. 


SYSTAT computes the W statistic and the p-value. е | 
For further information оп distributions and fitting procedures, refer to Evans e 


al.(2000), Johnson et al. (1994, 1995, 2005), Karian and Dudewicz (2000), pes 
and Johnson (1983). 


al order statistics based on 
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Fitting Distributions in SYSTAT 


Fitting Distributions: Discrete Dialog Box 


To open the Fitting Distributions: Discrete dialog box, from the menus choose: 
Analyze 
Fitting Distributions 
Discrete ... 


Analyze: Fitting Distributions:Disc rete ЕДЕЗ 


COUNT 
FREQUENCY 


<Required> 


8 Add > | 


<- Remove | 


Benford's law 
Binomial 
Discrete uniform 
Geometric 


Poisson 


n anrima 


Selected variable(s). Select the ү. 


elected vari ariable(s) that contains the data for which a discrete 
distribution is to be fitted. 
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Selected distribution. Distribution selected for fitting to your data. You can specify the 
associated parameter(s) if known; else SYSTAT computes their estimates. The default 
distribution is the Poisson distribution with the parameter estimated from the data. The 
following discrete distributions are available: 


Save frequencies. You can save cla 


a 


Benford's law. Fits the Benford's law(B) distribution to the data. 


Binomial. Fits the binomial(n, p) distribution to the data. You have the option to 
specify the value of p. However, you must specify the value of л. 


Discrete uniform. Fits the discrete uniform(N) distribution to the data. 
Geometric. Fits the geometric(p) distribution to the data. 

Hypergeometric. Fits the hypergeometric(N, m, п) distribution to the data. You can 
specify a value of N, if you wish. However, you must enter the values of 

m and л. 

Logarithmic series. Fits the logarithmic series(theta) distribution to the data. 
Negative binomial. Fits the negative binomial(K, p) distribution to the data. 
Poisson. Fits the Poisson(/ambda) distribution to the data. 

Zipf. Fits the Zipf(shp) distribution to the data. 

ss intervals, observed, and expected frequencies to 


specified file. 


Fitting Distributions: Continuous Dialog Box 


To open the Fitting Distributions: Continuous dialog box, 


from the menus choose: 


Analyze 


Fitting Distributions 
Continuous ... 
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Analyze: Fitting Distributions:Continuous 


<Required> 


| 
қ 
{ 
| 
t! 


Selected variable(s). Select the variable(s) that contains the data for which you want to 
fit continuous distribution(s). 


computes parameter estimates. The default distribution is a normal distribution with 
parameters estimated from the data, The following distributions are available: 


m Beta. Fits the beta(shp/, shp2) distribution to the data. 
Ш Cauchy. Fits the Cauchy(loc, sc) distribution to the data. 
Ш Chi-square. Fits the chi-square(d) distribution to the data, 
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Save frequencies. You сап save class intervals, 
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Double exponential (Laplace). Fits the double exponential(/oc, sc) distribution to 
the data. 


Erlang, Fits the Erlang(shp, sc) distribution to the data. 
Exponential. Fits the exponential(/oc, sc) distribution to the data. 
Gamma. Fits the gamma(shp, sc) distribution to the data. 
Gompertz. Fits the Gompertz(P, с) distribution to the data. 
Gumbel. Fits the Gumbel(/oc, sc) distribution to the data. 

Inverse Gaussian (Wald). Fits the inverse Gaussian(/oc, sc) distribution to the data. 
Logistic. Fits the logistic(/oc, sc) distribution to the data. 

Logit normal. Fits the logit normal(/oc, sc) distribution to the data. 
Loglogistic. Fits the loglogistic(/ogsc, shp) distribution to the data. 
Lognormal. Fits the lognormal(/oc, sc) distribution to the data. 
Normal. Fits the normal(/oc, sc) distribution to the data. 

Pareto. Fits the Pareto(thr, shp) distribution to the data. 

Rayleigh. Fits the Rayleigh(sc) distribution to the data. 


Smallest extreme value. Fits the smallest extreme value(loc, sc) distribution to the 


data. 
Triangular. Fits the triangular(a, b, c) distribution to the data. 


Uniform. Fits the uniform(min, max) distribution to the data. 


Weibull, Fits the Weibull(sc, shp) distribution to the data. 
observed, and expected frequencies to 


a specified file. 


Using Commands 


For fitting discrete distributions: 


First, specify the data fi 


le with USE filename, and continue with: 


FITDIST 

SAVE filename Е 
DISCRETE varlist /DISTRIBUTION 09 
/DISTRIBUTION = name (parameter. 
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Distribution Name Parameter(s) 
Benford’s law BL B 
Binomial N n porn 
Discrete uniform DU N 
Geometric GE p 
Hypergeometric H № m, nor m,n 
Logarithmic series LS theta 
Negative binomial NB kp 
Poisson P lambda 
Zipf ZI shp 


For fitting continuous distributions: 


First, specify the data file with USE filename, and continue with: 


FITDIST 
SAVE filename 


CONTINUOUS varlist /DISTRIBUTION- namel, name2, .. 


/DISTRIBUTION- name (parameter list) 


Distribution Name Parameter(s) 
Beta B shpl,shp2 
Cauchy G loc,sc 
Chi-square X df 
Double exponenial (Laplace) | DE loc,sc 
Erlang ER shp,se 
Exponential E loc,sc 
Gamma G shp,sc 
Gompertz GO bc 
Gumbel GU loc,sc 
Inverse Gaussian (Wald) IG loc,sc 
Logistic L loc,sc 
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Logit normal EN loc,sc 
Loglogistic LO logsc, shp 
Lognormal LN loc,sc 
Normal Z loc,sc 
Pareto PA thr,sc 
Rayleigh R sc 
Smallest extreme value SE loc,sc 
Triangular TR a,b,c 
Uniform U min,max 
Weibull үу sc,shp 
Note: 


min = Minimum; max = Maximum 
loc =Location parameter; sc = Scale parameter 
shp = Shape parameter; thr = Threshold parameter; /ogsc = log of scale parameter 


Usage Considerations 
Types of data. The input data should be ‘raw’ (un-binned) in a column. 
Print options. The output is standard for all PLENGTH lengths. 


Quick Graphs. FITDIST produces à graph of theoretical densities and relative 
frequencies overlaid when fitting continuous distributions. While fitting discrete 
distributions to the data, the theoretical probabilities and the relative frequencies graph 
are plotted side by side. 


Saving files. You can save class intervals, observed and expected frequencies to à 


specified file. 


BY groups. FITDIST analyses data by groups. 


Case frequencies. FREQUENCY is available in FITDIST for fitting discrete probability 


distributions. 


Case weights. WEIGHT is not available in FITDIST. 


Examples 


Data sets used in Examples 1 to 6 are taken from Hand et al., (1996). 
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Ехатріе 1 
Fitting Binomial Distribution 


For a psychological experiment on visual perception, an investigator needs to color 
155520 squares using either a black color with probability 0.29 or a white color with 
probability 0.71. For making the coloring decision, a computer was used. From the 
155520 squares, 1000 non-overlapping samples each containing 16 small squares were 
randomly selected and the number of black Squares were counted in each case. The 
frequency distribution of this count is given in file PATTERN. The question is whether 
the data follow a binomial distribution, 


The input is: 


FITDIST 

USE PATTERN 

FREQUENCY FREQUENCY 
DISCRETE COUNT / DIST=N (16) 


The output is: 


Variable Name : COUNT 

Distribution : Binomial 

Specified Parameter (s) 

Number of Trials(n) : 16.000 

Estimated Parameter (s) 

Probability of Success : 0.295 

Estimation of Parameter(s): Maximum Likelihood Method 


Test Results 


Lower Limit Upper Limit Observed Expected 


0 1 30 28.691 
2 2 93 78.312 
3 3 159 152.875 
4 4 184 207.837 
5 5 195 208.659 
6 6 171 160.022 
7 7 92 95.628 
8 8 45 45.003 
9 9 24 16.734 
10 7 6.240 
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Chi-square Test Statistic : 10.827 
Degrees of Freedom : 8 
p-value : 10:212 


*** WARNING *** : One or more parameters of distribution are estimated. 
Significance of K-S test computed on this distribution is suspect. 


Kolmogorov-Smirnov Test Statistic : 0.022 
p-value : 0.712 
Fitted Distribution 
0.209 ағы жат. Т СЕР ТІ 
к = 
= 
5 | 
e 
A- 0.070} 
Ш Observed Relative Frequency 
|| Wi Expected Probability 
0.000 — 2 $ 4 519 ИЛИ ЧОНО 
COUNT 


The p-value indicates that the data can be modeled by a binomial distribution. 


Example 2 
Fitting Discrete Uniform Distribution 


The data file BIRTHS contains information on the frequency of births in each month 
(labeled 1, 2, ...12) of a year in the University Hospital of Basel, Switzerland. The à 
investigator wants to know if the births are evenly spread throughout the year, that is, 
if the data follow a discrete uniform distribution over the 12 months. 


The input is: 


FITDIST 

USE BIRTHS = es 
FREQUENCY FREQUEN 

DISCRETE MONTH / DIST-DU (12) 
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The output is: 


Variable Name : MONTH 
Distribution ; Discrete Uniform 


Specified Parameter (5) 
Number of Points(N) : 12.000 
Test Results 


Lower Limit Upper Limit Observed Expected 


1 1 66 58.333 
2 2 63 58.333 
3 3 64 58.333 
4 4 48 58.333 
5 5 64 58.333 
6 6 74 58.333 
7 7 70 58.333 
8 8 59 58.333 
9 9 54 58.333 
10 10 51 58.333 
11 11 45 58.333 
12 . 42 58.333 


700 700.000 


Chi-square Test Statistic : 19.726 
Degrees of Freedom т 11 
P-value : 0.049 


Kolmogorov-Smirnov Test Statistic : 0.059 
P-value : 0.015 


Fitted Distribution 


0.106 


0.070 


ity 


Probabil 


0.035 - 


E Observed Relative Frequency 
W Expected Probability 


0 2346 


9 0 1 12 13 


6-7 3 
MONTH 
The p-values of both the tests indicate that the number of births is not uniformly 
distributed throughout a year. 
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Example 3 
Fitting Exponential Distribution 


Do earthquakes occur at random? The data file QUAKES contains the time in days 

between successive occurrences of serious quakes worldwide. If earthquakes occur at 
random, an exponential distribution with a mean occurrence time of 430 days will be 
a suitable model for time between earthquakes. To see this, we fit E(0, 430) to the data. 


The input is: 


FITDIST 
USE QUAKES 
CONTINUOUS TIME /DIST=E(0, 430) 


The output is: 


Variable Name : TIME 
Distribution : Exponential 


Specified Parameter(s) 


Location(theta) : 0.000 
Scale (lambda) : 430.000 


Test Results 


Lower Limit Upper Limit Observed Expected 


. 198.200 21 22.897 
198.200 387.400 14 13.919 
387.400 576.600 8 8.965 
576.600 955.000 15 9.492 
955.000 А 4 6.728 
62 62.000 
Chi-square Test Statistic : 4.564 
Degrees of Freedom H 4 
p-value : 0.335 
Kolmogorov-Smirnov Test Statistic : 0.080 


p-value 
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Fitted Distribution 
25 04 
E 033 
= 15- 3 
8 0.23 
10- 8 
т 
5L 01% 
0 
% 482 955 1428 1961 
TIME 
The p-value for the chi-square goodness-fit-test indicates that the exponential 
distribution is a good fit. This Supports the claim that time between serious earthquakes 
can be modeled by an exponential(0, 430) distribution. 
Example 4 


Fitting Gumbel Distribution 


The file MINTEMP contains the annual minimum temperature (F) of Plymouth (in 
| Britain) for 49 years (1916-1964). Barnett and Lewis ( 1967) fitted a Gumbel 
distribution to the data, 


The input is: 


FITDIST 
USE MINTEMP 
CONTINUOUS TEMP/DIST- GU 
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The output is: 
Variable Name : TEMP 
Distribution : Gumbel 
Estimated Parameter (s) 


Location(alpha) : 23.374 
Scale (theta) : 2,959 


Estimation of Parameter(s): Method of Moments 


Test Results 


Lower Limit Upper Limit Observed Expected 


22.000 11 9.984 

22.000 23.500 4 8.811 
23.500 25.000 13 8.718 
25.000 26.500 2 7.098 
26.500 29.500 14 8.582 
29.500 : 5.806 
49 49.000 


Chi-square Test Statistic : 42925 


Degrees of Freedom : 
p-value : 0.007 


*** WARNING *** : One or more parameters of distribution are estimated. 
Significance of K-S test computed on this distribution is suspect. 


Kolmogorov-Smirnov Test Statistic : 0.153 
p-value : 0.200 


Fitted Distribution 


clude that fit is not good. 


From the p-value for the chi-square test 0.007, we can concludi s not goo 
Smirnov test is high since 


Here one should note that the p-value for the Kolmogorov- 
we are estimating the parameters. 
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Example 5 
Fitting Normal Distribution 


The data in FOREARMI contains the length of the forearm (in inches) from Pearson 
and Lee (1903). A normal distribution may be an appropriate model to describe the 
data on the forearm length, 


The input is: 


FITDIST 
USE FOREARM1 
CONTINUOUS ARMLENGTH 


The output is: 


Variable Name : ARMLENGTH 
Distribution : Normal 


Estimated Parameter (з) 


Location or Меап(ти) : 18.802 
Scale or SD(sigma) : 1.116 


Estimation of Parameter(s): Maximum Likelihood Method 


Test Results 


Lower Limit Upper Limit Observed Expected 


17.160 11 9.893 
17.160 17.690 12 12.450 
17.690 18.220 16 19.802 
18.220 18.750 29 25.247 
18.750 19.280 22 25.802 
19.280 19.810 24 21.138 
19.810 20.340 11 13.881 
20.340 . 15 11.786 


Chi-square Test Statistic : 3.850 
Degrees of Freedom H 5 
p-value : 0.571 


Kolmogorov-Smirnov Test Statistic : 0.048 
p-value : 0.554 


Shapiro-Wilk Test Statistic : 0,992 
p-value : 0.590 
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Fitted Distribution 
30r 
0.2 

о 
i 1 
E 5 
8 ол 
w 
8 


11 0 
% 18 20 22 
ARMLENGTH 


The above analysis indicates that a normal distribution fits the data well. 


Example 6 
Fitting Weibull Distribution 


PU seconds, measured in terms of 


The file SOFTWARE! contains failure time (in C | 
1 software system. А Weibull model 


execution time) ofa real time command and contro 
is fitted to the inter-failure time (INTER). 


The input is: 


FITDIST 
USE SOFTWARE1 
CONTINUOUS INTER /DIST=W(520, 0.7) 


The output is: 

Variable Name : INTER 
stribution : Weibull 

Specified Parameter (5) 


um : 520.000 
: 0.700 
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Test Results 


Lower Limit Upper Limit Observed 


. 616.800 88 
616.800 1231.600 28 
1231.600 2461.200 10 

2461.200 . 
133 


Chi-square Test Statistic : 3.267 
Degrees of Freedom D 3 
P-value : 0.352 


Expected 


6.831 
133.000 


Kolmogorov-Smirnov Test Statistic : 0.038 
P-value : 0.991 


Fitted Distribution 


For graphical analysis, type: 


PPLOT INTER /EXP 
PPLOT INTER /WEI=520, 


0.7 


good. 
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Example 7 
Fitting Multiple Distributions 


data from the triangular (minimum=0, maximum- 


For this purpose, we first generate kimu 
зе = beta, and logit normal distributions. 


1, mode=0.4) distribution and try to fit triangular, 


The input is: 


RA VARIATE TRRN (0, 1, 0.4) /SIZE=100 RSEED=74124 
FITDIST 
SAVE FREQUENCIES 


CONTINUOUS 81 /DIST=TR B EN 
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The output is: 


Variable Name : 51 
Distribution 2 Triangular 


Estimated Parameter (s) 


Low(a) : 0.024 
High(b) : 0.974 
Mode(c) : 0.463 


Estimation of Parameter (s); Modified Maximum Likelihood and Moments 
Test Results 


Lower Limit Upper Limit Observed Expected 


Б 0.220 6 9.203 

0.220 0.313 13 10.805 

0.313 0.406 2 14.949 

0.406 0.499 16 18.528 

0.499 0.592 16 16.431 

0.592 0.685 12 12.865 

0.685 0.778 6 9.299 

0.778 4 10 7.920 

100: 100.000 
Chi-square Test Statistic : 6.141 
Degrees of Freedom : 4 
P-value : 0.189 


*** WARNING **+ : One or more Parameters of distribution 


И are estimated. 
Significance of K-S test computed on this dis 


tribution is Suspect. 


Kolmogorov-Smirnov Test Statistic : 0.058 
P-value : 0.891 


| Fitted Distribution 
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Variable Name : Sl 
Distribution : Beta 


Estimated Parameter (s) 


Shapel(alpha) : 2.626 
Shape2(beta) : 2.767 


Estimation of Parameter(s): Method of Moments 
Test Results 


Lower Limit Upper Limit Observed Expected 


0.220 6 9.743 
0.313 13 11.556 
0.406 21 14.797 
0.499 16 16.308 
0.592 16 15.912 
0.685 12 13.724 
0.778 6 10.138 

10 7.823 


Chi-square Test Statistic : 6.736 
Degrees of Freedom $ 5 
p-value : 0.241 


*** WARNING *** : One or more parameters of distribution are estimated. 
Significance of K-S test computed on this distribution is suspect. 


Kolmogorov-Smirnov Test Statistic : 0.055 
p-value : 0.923 


Fitted Distribution 
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Variable Name : 51 
Distribution : Logit Normal 


Estimated Parameter (з) 


Location(mu) : -0.035 
Scale(sigma) : 


Estimation of Parameter(s): Maximum Likelihood Method 


Test Results 
Logit transformation is used on data. 


Lower Limit Upper Limit 


Я -1.364 
71.364 -0.700 
-0.700 -0.037 
-0.037 0.627 
0.627 1.291 
1.291 . 
Chi-square Test Statistic : 6.223 
Degrees of Freedom Н 3 
P-value : 0.101 
Kolmogorov-Smirnov Test Statistic : 0.081 
P-value : 0.103 


Shapiro-Wilk Test Statistic : 0.964 
P-value : 0.008 
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You can try the following probability plots as further checks. 


PPLOT 51 /TRIANGULAR=0, 1, 0.4 
PPLOT 51 /BETA=3.367717, 3.430665 
PPLOT S1 /ENORMAL=-0.011409, 0.849718 


co o ee ОУ 
$1 


0.0 
oono oo orooro о ~ 
51 


око ок Фоо o9 о: 


Fitting Distributions 


Фое" 


0 
о о с оу e A? 
51 
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Сотршайоп 


Algorithms 


When closed forms of maximum likelihood estimates of parameters are not available 


for certain distributions, the Newton-Raphson method is used to solve the likelihood 
equations. 
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14 
Hypothesis Testing 


Tirthankar Chakraborty, A.Naresh Raj, and S. Anoopama 
(Revised version of “T Tests” by Laszlo Engelman in 
previous SYSTAT versions) 


The Hypothesis Testing feature provides several parametric tests of hypotheses and 
confidence intervals for means, variances, proportions, and correlations. You can 
perform a one-sample z-test and compute a confidence interval for the mean when 
you have a sample from a normal distribution with known standard deviation, and a 
t-test when the standard deviation is unknown. Similarly, you can perform a two- 
sample z-test and t-test, and compute confidence intervals for the difference between 
two population means. You can perform а paired t-test for equality of two means when 
the observations are paired (and hence correlated). You can also perform a test for the 
mean оҒа Poisson distribution, When multiple tests are performed, you can make two 
adjustments to the Type I error probabilities, namely Bonferroni and Dunn-Sidak 
adjustments. 
Techniques like regression and ANOVA require the assumption of equality of 
variances over groups. SYSTAT offers tests for a single variance, equality of two 
variances, and the Bartlett’s and Levene’s tests for equality of several variances. 
When you have a sample from a bivariate normal distribution, you can test for zero 
correlation and a specified value of correlation. You can test for equality of | 
correlation coefficients. In addition, you can test for proportions: single proportion 
and equality of two proportions. ) 
Graphs = а you perform tests for means and variances. AT vm 
an overlay of a box plot, a dot density, and a density plot ina single frame. Scatter 
plots are produced when tests are performed for correlation coefficients. 
Resampling procedures are available in this feature. 
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Statistical Background 


Hypothesis testing procedures form an essential tool in any scientific investi gation. A 
scientific investigation usually begins with the formulation of suitable hypotheses. To 
test the validity of such hypotheses, data are collected through experiments or surveys. 
Several procedures are available to test different types of hypotheses. A 
comprehensive and elaborate discussion of various aspects of the hypothesis testing 
problem is available in Statistics IV, “Power Analysis“ on page 19. For additional 
information, see Snedecor and Cochran (1989). 

The Hypothesis Testing feature in SYSTAT offers one-sample, two-sample, and 
multi-sample tests for testing a null hypothesis against an alternative hypothesis of the 


following types: ‘not equal’, ‘less than’, and “greater than’, and provides confidence | 
intervals, 


One-Sample Tests and Confidence Intervals for Mean and Proportion 


z-test. When you have a sample of observations from a normal distribution with 
known standard deviation, or the sample size is sufficiently large, you can perform 
the z-test for comparing the mean to a specified value. In addition, you can compute а 
confidence interval of a specified confidence coefficient for the mean. 


t-test. When the data come from a normal distribution with unknown standard 


deviation, you can perform the t-test, and obtain the Corresponding confidence interval 
for the mean. 


Proportion test. You can perform the binomial test for 


з А Proportions, and compute а 
confidence interval for a single proportion, 


Two-Sample Tests and Confidence Intervals Sor Means and Proportions 


z-test. You can perform the two-sample z-test for equality of two means when the 
population distributions for both the variables under Study follow normal distributions 
with a known common standard deviation, or if the sample sizes are sufficiently 
large. A confidence interval for the difference between the means сап be computed. 
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‘Two-sample t-test. You can perform the two-sample t-test for equality of two 
population means and compute a confidence interval for the difference between the 
means when the data come from two independent normal populations with the same 
but unknown standard deviation. When the standard deviations are not the same, a 
modified form of the t-test is required. SYSTAT performs both the tests and the 
corresponding outputs are provided. 


Paired t-test. The paired t-test assesses the equality of two means in experiments 
involving paired measurements.In practice using the paired t-test, you can compute the 
differences between values of two variables and test whether the average of the 
differences in the populations differs from zero. 


Equality of proportions test. You can perform this test based on a normal 
approximation for testing equality of two proportions when dealing with two 
independent groups whose members can be classified into one of the two categories of 
a binary response variable. A confidence interval for the difference between the 


proportions can also be obtained. 


Tests for Variances and Confidence Intervals 


Single variance. You can perform the chi-square test for the variance of a normal 


distribution, and obtain a confidence interval for the variance. 


Equality of two variances. A two-sample t-test assumes that the data come from two 
independent normal distributions with equal variances. Before performing the test, it 
is worthwhile to verify this assumption. For this purpose or otherwise, you сап perform 
a test for equality of two variances using the F-test. In addition, you can obtain a 
confidence interval for the ratio of population variances. 


i ity of several 
Bartlett's and Levene's tests. These tests are used for testing the equality 


variances, Equality of variances is one of the crucial assumptions dat Ж 
procedures like ANOVA and regression analysis. You can perform the р * Па 
Levene’s tests for checking the validity of the assumption or vede The Ва! des! 
test is sensitive to departures from normality, while the Levene's test is compar? i и 
robust. Therefore, when the data distribution deviates from normal, the Levene's tes 


is a better choice. 
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Tests for Correlations and Confidence Intervals 


Zero correlation coefficient. You can perform the t-test to determine whether a 
correlation coefficient is zero. The test is Suitable when the sample is derived from a 


bivariate normal distribution, A confidence interval for the correlation coefficient is 
provided. 


Specific correlation coefficient. You can perform the Fisher’s test to determine 
whether the correlation coefficient between two variables equals a specified value. A 
confidence interval for the correlation coeffficient can also be obtained. 


Equality of two correlation coefficients. You can perform the test based on the 
Fisher’s z-transformation for equality of two correlation coefficients. 


Multiple Tests 


SYSTAT allows the users to request tests for 
while testing for mean(s). The p-value associated with the test assumes that the user is 


offers two adjustments to the Type I error probabilities: 


ш Bonferroni. The Bonferroni adjustment is appropriate when more than one test is 


performed simultaneously. It drops the second and higher order terms from the 
binomial expansion of the expression 1-(1-р)", The Bonferroni adjusted 
probability is np; it is valid only for small p, 

Ш Dunn-Sidak. The Dunn-Sidak ai 
is performed simultaneously, F 
1-(1-р)”. 


djustment is appropriate when more than one test 
or л tests, the Dunn-Sidak adjusted probability is 
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| Hypothesis Testing іп SYSTAT 
Tests for Mean(s) 


One-Sample z-Test Dialog Box 


To open the One-Sample z-Test dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Mean 
One-Sample z-Test... 


Mean: One-Sample z-Test 


Main | Re 


Selected variable(s]: 


Available variable(s]: 
б <Required> 


LEAD 


<. Remove 


| Mean SD: 


an Adjustment 
| Alternative type: not equal | м [Г] Bonferroni 


| Confidence: fos | [Г] Dunn-Sidak 


fied to perform the one-sample z-test: 


The following should be speci 


Selected variable(s). Variables select 


ed for which one-sample z-tests are desired. Each 
mple z-test. For multiple tests, use the 


arate one-Sal 
abi and Dunn-Sidak) to control a Type | error 


variable corresponds to А 
Bonferroni 


optional p-value adjustments ( 
probability. 
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Mean. Enter a constant value to be com; 


pared with the sample mean, for each of the 
selected variables. 


SD. Enter the population (known) standard deviation. 


Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 


W greater than 

Ш less than 

m not equal 

The default option is ‘not equal’. 


Confidence. Specify a Confidence level for the confidence interval for the mean. 


Two-Sample z-Test Dialog Box 


To perform the two-sample z-test, data should be in the form of a grouping variable in 
one column and the variable for testing in another. You may use Reshape under the Data 
menu if required, for obtaining this format, 

To open the Two-Sample 


Analyze 
Hypothesis Testing 
Mean 


z-Test dialog box, from the menus choose: 


Two-Sample z-Test... 
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Mean: Two-Sample z-Test 


von [Bsa 


Available variable(s): Selected variable(s]: 
COUNTRYS — | «Required» 
рор 1983 _ Ж | 2” 
РОР 1986 
РОР 1990 
РОР 2020 
URBAN " 


| | aamu aa — 
| à (n 
Nl. E 


Alternative type: 


| 


Confidence: 


(ЕСІ 


The following should be specified to perform the two-sample z-test: 


Selected variable(s). Variables selected for which two-sample z-tests are desired. Each 
ple z-test. For multiple tests, use the 


variable corresponds to a separate two-sam) = 1 e Terror 
optional p-value adjustments (Bonferroni and Dunn-Sidak) to control а ТУР 


probability. 

Grouping variable. Select the grouping variable. 

SD 1. Enter the (known) standard deviation for group 1. 
SD 2. Enter the (known) standard deviation for group 2. 
Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 

W greater than 

W less than 

= not equal 


The default option is ‘not equal’. 
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Confidence. Specify a Confidence level for the confidence interval for the difference 
in the means. 


One-Sample t-Test Dialog Box 


To open the One-Sample t-Test dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Mean 


One-Sample t-Test... 


Mean: One-Sample t-Test 


Available variable(s} Selected variable(s): 
POP_1983 <Required> 
POP_1986 | m 
РОР 1390 
POP 2020 


I IDDARI 


1 Ш 


Меат: 


Altemative type: Roy Adjustment 
(О Bonferroni 


Confidence: _] ChDunnsidak 


The following should be Specified to perform the one-sample t-test: 


Selected variable(s). Variables selected for which one-sample t-tests are desired. Each 
variable Corresponds to a separate one-sample t-test. For multiple tests, use the optional 
p-value adjustments (Bonferroni and Dunn-Sidak) to control a Type I error probability. 


Mean. Enter a constant value to 


i be compared with the sample mean, for each of the 
selected variables. 
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Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 

= greater than 

= less than 

w not equal 

The default option is ‘not equal’. 


Confidence. Specify a Confidence level for the confidence interval for the mean. 


Paired t-Test Dialog Box 


To perform the paired t-test, the data should be in two columns, data on each case 
appearing as a pair in two columns. 
To open the Paired t-Test dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Mean 
Paired t-Test... 


Selected variable(s}: 
<Required> 


Available variable(s] и" 

[ РАТЕМТ 10 | 

SYSBP. BEFOR 1 

SYSBP. AFTER | 
DIABP. BEFORE 

= м 


maoo лстсо = 


я аша 


Alternative type: 
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The following should be specified to perform the paired t-test: 


Selected variable(s). Variables selected for which paired t-tests are desired. If more than 
two variables are selected, each variable pair results in a separate paired t-test. For 
multiple tests, use the optional p-value adjustments (Bonferroni and Dunn-Sidak) to 
control a Type I error probability. 


Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 


W greater than 
W less than 
ш not equal 


The default option is ‘not equal’, 


` Confidence. Specify a Confidence level for the confidence interval for the difference in 
means. 


Two-Sample t-Test Dialog Box 


To perform the two-sample t-test, data should be in the form of a grouping variable in 
one column and the variable for testing in another. You may use Reshape under the Data 
menu if required, for obtaining this format, 


To open the Two-Sample t-Test dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
ean 


Two-Sample t-Test... 
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Mean: Two-Sample t-Test 


| Available variable(s): Selected variable(s]: 
[ | s «Required» 


| 
| 
| 


MARITAL — кой 
EDUCATN Grouping variable: 
wi = 


«uam m 


Alternative type: 


Confidence: 


The following should be specified to perform the two-sample t-test: 

r which two-sample t-tests are desired. Each 
le t-test. For multiple tests, use the 
Dunn-Sidak) to control a Type I error 


Selected variable(s). Variables selected fo 
variable corresponds to a separate two-samp 
optional p-value adjustments (Bonferroni and 
probability. 


Grouping variable. Select the grouping variable. 


Alternative type. Choose the form of the alternat 
available: 


ive hypothesis. Three options are 


Ш greater than 

ш less than 

m not equal 

The default option is ‘пої equal’. 

Confidence. Specify a Confidence level for the c 
the means, 


onfidence interval for the difference in 
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Poisson Test Dialog Box 


To open the Poisson Test dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
lean 


Poisson Test... 


Selected variable(s}: 
SAMPLE «Required» 
DEFECTS E 7 
SIZE 
TIME 


Mean: 


Қ Adjustment 
a: == [0 Bonferroni 
Confidence: [О Dunn-Sidak 


The following should be specified to perform the Poisson test: 


Selected variable(s). Variables Selected for which Poisson tests are desired. Each 
variable corresponds to a separate Poisson test. For mul 


А tiple tests, use the optional 
p-value adjustments (Bonferroni and Dunn-Sidak) to co 


ntrol a Type I error probability. 
Mean. Enter a constant value to be compa 


available: 


W greater than 
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W less than 

m not equal 

The default option is ‘not equal’. 

Confidence. Specify a Confidence level for the confidence interval for the square root 
of Poisson mean. 


Tests for Variance(s) 


Single Variance Dialog Box 


To open the Single Variance dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Variance 
Single Variance... 


Variance: Single Variance 
| Main | Resampling 

| Selected variable(s): 
| MB <Required> 

BH 
BL 
NH 


YEAR 


Available variable(s} 
| башынан AS 
| 


Variance: 


Alternative type 


| 


Сопћдепсе: 


rform the test for single variance: 


The following should be specified to pe 
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Selected variable(s). Variables selected for which single variance tests are desired. 
Each variable corresponds to a separate single variance test, 


Variance. Enter a constant valu 


е to be compared with the sample variance, for each of 
the selected variables, 


Alternative type. Choose the form of the alternative hypothesis. Three options 


are 
available: 


W greater than 
W less than 
Ш notequal 


The default option is “not equal’, 


Confidence. Specify a Confidence level for the confidence interval for the variance. 


Equality of Two Variances Dialog Box 


Analyze 
Hypothesis Testing 
Variance 
Equality of Two Variances. . . 
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| Available vanable(s) _ Selected variable(s): 
| MB <Required> 
BH 
BL ЈЕ Remove | 


Add > | 


Grouping variable: 


| — | 
z- Remove | 


Altemative type: not equal м 


Confidence: 


The following should be specified to perform the test for equality of two variances: 


Selected variable(s). Variables selected for which the equality of two variances tests are 


desired. 
Grouping variable. Select the grouping variable. 


Alternative type. Choose the form of the alternative hypothesis. Three options are 


available: 

Ш greater than 
m less than 

= not equal 


The default option is “not equal’. 
dence interval for the ratio of the 


Confidence. Specify a Confidence level for the confi 


two population variances. 


1-534 
Chapter 14 


Equality of Several Variances Dialog Box 


To perform the Bartlett's and Levene's tests for equality of several variances, the data 
should be in the form ofa grouping variable in one column and the variable for testing 


in another. You may use Reshape under the Data menu if required, for obtainin g this 
format. 


To open the Equality of Several Variances dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Variance 
Equality of Several Variances. . . 


Equality of Several Variances 


Available variable(s]: Selected variable[s]: 
«Required» 


SUPPLIER ші. Шы 
(< Remove | 


Grouping variable: 


Add -5 «Required 


<- Remove 


The following should be specifi "several variances: 


Selected variable(s). Variables selected for which equality of Several variances tests are 


desired. 


Grouping Variable, Select the grouping variable. 
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Tests for Correlation(s) 


Zero Correlation Dialog Box 


To open the Zero Correlation dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Correlation 
Zero Correlation... 


Correlation: Zero Correlation 


Main | Resampling 


| Available variable(s}: 
MB 
BH 
BL 


Selected variable(s]: 
«Required» 


<- Remove | 


[wena Ш 


The followin t for zero correlation: 


g should be specified to perform the tes 
s selected for zero correlation test. Each 


pair of variables 


Selected variable(s). Variable 
ds to a separate zero correlation test. 


correspon 
e form of the alternative hypothesis. 


Alternative type. Choose th Three options are 
available: 
m greater than 


= less than 
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= not equal 


The default option is ‘not equal’. 


Confidence. Specify a Confidence level for the confidence interval for the correlation 
coefficient. 


Specific Correlation Dialog Box 


To open the Specific Correlation dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Correlation 
Specific Correlation, a 


Correlation 


Correlation Specific 


Selected variable{s): 


«Required» 
"i ТЕ 
нн | Ми Remove | 
YEAR к 


Correlation: [; "mc 
Alternative type: not equal 
Е 


Тһе following should be specified to perform the test for specific correlation: 


ables selected for which specifi 
Tresponds to a separate specifi 


Correlation. Enter a Constant value to be compared with the Sample correlation, for 
each of the selected variables. 


Selected variable(s), Vari 


€ correlation tests are desired. 
Each pair of variables со 


с Correlation test. 
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Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 

Ш greater than 

Ш less than 

= not equal 

The default option is ‘not equal’. 

Confidence. Specify a Confidence level for the confidence interval for the correlation 
coefficient. 


Equality of Two Correlations Dialog Box 


To open the Equality of Two Correlations dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Correlation 
Equality of Two Correlations... 


Equality of Two Correlations 


—À жола, 


| Main | Flesamplin 


Available variable(s}: 


MB 
BH 
BL 
NH Set 2: 


——— 


YEAR "7 «Required» 


«Required» 
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Tests for 


The following should be Specified to perform the test for equality of two correlations: 
Set 1. Select the variables belonging to the first population. 
Set 2. Select the variables belonging to the second population. 


Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 


Ш greater than 
Ш less than 
Ш not equal 


The default option is ‘not equal’, 


Proportion(s) 


Single Proportion Dialog Box 


To open the Single Proportion dialog box, from the menus choose: 


Analyze 
Hypothesis Testing 
Proportion 
Single Proportion... 
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Proportion: Single Proportion 


| Main | 


Add-> | [Required 


c Remove | 


© Aggregate 
Number of trials: 


Number of successes: 


| 


Proportion: 


| Confidence 


(6141« 


The following should be specified to perform the test for single p 


roportion: 


Raw data, Select this option if the data set for testing is in a data file. 
presents the number of trials. 
presents the number of successes. 


m Trials. Select the variable that ге 
= Successes. Select the variable that re 
Aggregate. Select this option if the data for testing are being keyed in: 
= Number of trials. Enter the number of trials. 

m Number of successes. Enter the total number of succes 
ant value to which you want to compare the sample 


Ses. 


Proportion. Enter the const 


proportion. 


Alternative type. Choose the form of the alternative hypothesis. Three options are 


available: 
Ш greater than 
W less than 


= not equal 
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The default option is ‘not equal’. 
Confidence. Specify a Confidence level for the confidence interval for the proportion, 


Equality of Two Proportions Dialog Box 


To open the Equality of Two Proportions dialog box, from the menus choose: 
Analyze 
Hypothesis Testing 
Proportion 
Equality of Two Proportions... 


= Trials 1. Select the variable that represents the number of trials (sample size) in the 
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m Successes 2. Select the variable that represents the number of successes in Trials 2. 
Aggregate. Select this option if the data set is being keyed in: 
m Sample 1. Enter the following for the first population: 
m Number of trials. Enter the number of trials. 
= Number of successes. Enter the number of successes. 
m Sample 2. Enter the following for the second population: 
m Number of trials, Enter the number of trials. 
= Number of successes. Enter the number of successes. 
Alternative type. Choose the form of the alternative hypothesis. Three options are 
available: 
Ш greater than 
ш less than 
= not equal 


The default option is ‘not equal’. 


Confidence. Specify a confidence level for the confidence interval for the difference in 


the proportions. 
Using Commands 


For testing of means 


: Р A 
To perform а one-sample z-test, after selecting your data file with USE filenam 


continue with: 


кет varlist = CONSTANT / 5-8 BONF DUNN CONFI-n ALT-option 


To perform a two-sample z-test: 


TESTING 


- =52 BONF DUNN CONFI=n, 
ZTEST varlist * grpvar / 8р1=в1 SD2 


ALT-option 
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To perform a one-sample t-test: 


TESTING 


TTEST varlist = CONSTANT / BONF DUNN CONFI=n ALT=option 


To perform a paired t-test: 


TESTING х 
ТТЕ5Т varlist / BONF DUNN CONFI=n ALT=option 
To perform a two-sample t-test: 
TESTING 


TTEST varlist + grpvar / BONF DUNN CONFI=n ALT=option 


To perform a Poisson test: 


TESTING 
POISSON varlist = 


CONSTANT / BONF DUNN CONFI=n ALT=option 
For testing of variances 


To perform a test for single variance, after selecting your data file with USE filename, 
continue with: 


TESTING 
VARI varlist = 


= CONSTANT / CONFI=n ALT=option 


For testing equality of two variances: 


TESTING 

VARI varlist * grpvar / CONFI-n ALT-option 
If grpvar has more than two distinct categories, SYSTAT performs Bartlett’s and 
Levene's test, 


For testing of correlation coefficients 


TESTING 


TCORR varlist / CONFI=n ALT=option 
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For testing specific correlation coefficient: 


TESTING 
TCORR varlist = CONSTANT / CONFI=n ALT=option 


For testing equality of two correlation coefficients: 


TESTING 
TCORR varl var2 = var3 var4 / ALT=option 


For testing of proportions 


For testing single proportion: 


TESTING 


PROP trialvar successvar = P / CONFI-n ALT=option 


In the case of Aggregated data of N trials with X successes: 


TESTING у 
PROP М X = Р / CONFI=n ALT=option 


For testing equality of two proportions: 


TESTING 
і = trialvar2 successvar2 / CONFI=n, 
PROP trialvarl successvarl tr. к лд 


For Aggregated data: 


TESTING 


PROP МІ X1 = N2 X2 / CONFI-n ALT=option 


TESTING allows you to use resampling commands: SAMPLE = BOOT(m, n) / 


SIMPLE(m,n) / JACK. 


Note: ALT may assume one of the following depending on the requirement of your 


problem: NE, GT, and LT. The default is NE. 


Usage Considerations 


g must be in rectangular format and the test 


) thesis Testin 
туры t Oa у able for the two-sample z-test, two- 


variables must be numeric. The grouping уап 
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sample t-test, equality of two variances and equality of several variances can contain 
either numbers or characters, 


Print options. The output is standard for all PLENGTH options. 


Quick Graphs. The Quick Graph produced depends on the test. The tests for means 
and variances produce Quick Graphs combining three graphical displays: a box plot 
displaying the sample median, quartiles, and outliers (if any), a normal curve 
calculated using the sample mean and standard deviation, and a dot plot displaying 
each observation. A paired t-test produces a Quick Graph in which, for each case pair, 
а line connects the values on the two variables. For testing of correlation coefficients 
the Quick Graph consists of a Scatter plot. Quick Graphs are not available for testing 
several variances and for testing of proportions. 


E 


Saving files, TESTING does not save the results of the analysis. 
BY groups. TESTING analyzes data by groups. 


Case frequencies. TESTING uses the FREQUENCY variable, if present, to duplicate 
cases, 


Case weights. TESTING uses the WEIGHT variable, if present, to weight cases for the 
features: one-sample t-test, paired t-test, and two-sample t-test. 


) Examples 


Example 1 
One-Sample z-Test 


To illustrate the use of the one-sample z- 
(mg/kg dry weight) at 37 stations in Kenya, i i 


The input is; 


TESTING 
USE LEAD 
ZTEST LEAD=30 / SD=37.1239 CONF-0.95 ALT=GT 
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The output is: 


One-sample z-Test of LEAD with 37 Cases 


Ho: Mean = 30.00 vs Alternative = ‘greater than' 
Mean : 37.243 
35.00% Confidence Bound : 27.204 
Assumed Standard Deviation : 37.124 
2 $ 4.187 
p-value 1710.118 


One-sample z-test 


ООООО 

оороо 

ООООО 
D 


ot be a significant difference in the lead 


The above graph shows that there may n weight. 
concentration in the inshore waters, from the assumed mean of 30 mg/kg dry И 


The p-value = 0.118 confirms this observation. 


Example 2 
Two-Sample z-Test 
i ies? To 
Is the life expectancy of males same among the developed ЕС кш: 
investigate this, we use the OURWORLD data file. Here, we аі оле 
deviations are known and so we can perform the lideres m le (LIFEEXPM) is 
countries in the OURWORLD data file, the life expectancy of males 


recorded for the developed and emerging countries. 
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The input is: 
TESTING 
USE OURWORLD 
ZTEST LIFEEXPM ж GDP$ / spi= 3.0 SD2=9.0 CONF=0.95 ALT=NE 
The output is: 
Two-sample 2-Test on LIFEEXPM Grouped by GDP$ vs Alternative - 'not equal' 
i Standard 
GROUP iN Mean Deviation 
ce S (Асет E ai ec ісе 
Developed | 30 70.833 3.833 
Emerging 1,29 58.704 9.968 
Difference in Means : 12.130 
95.00% Confidence Interval ; 8.569 to 15.690 
2 : 6.677 
P-value 2 0.000 
Two-sample z-test 


LIFEEXPM 


The quick graph here consists of dot density, box 
the developed and emerging nations, laid Side-b 
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Example 3 
One-Sample t-Test 


We use the data file OURWORLD to illustrate the use of the one-sample t-test to 
investigate whether Europe’s population will remain stable over the years. If the 
population is to remain stable, then the ratio of the birth rate to the death rate should 
not exceed 1.25, that is, five births for every four deaths. Should we reject the null 
hypothesis that the average European birth-to-death ratio is 1.25? 


The input is: 


TESTING 
USE OURWORLD 
SELECT (GROUP$ 
TTEST B TO D = 
SELECT 


= "Europe") 
1.25 / CONF-0.95 ALT-NE 


The output is: 


he following results were selected according to 
Ст (GROUPS = "Europe") 


imple t-test of B TO D with 20 Cases F 
fo: Mean 1.250 vs Alternative = 'not equal 


14 7 

» Interval : 1.157 to 1.357 

: 0.213 
0 


p-value : 0.884 


1-548 
Chapter 14 


One-sample t-test 


О 
о 
О 
10 15 20 


The average birth-to-death ratio for the European countries in the sample is 1.257. We 
are unable to reject the null hypothesis since the p-value is 0.884. We do not have 
sufficient evidence that Europe’s population will increase in size, 


Example 4 
Paired t-Test 


essential hypertension, immediately before and two 


hours after administering the drug, 
captopril, 


The input is: 


TESTING 
USE BP 


TTEST SYSBP BEFORE SYSBP AFTER / ALT-GT 
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The output is: 


Paired Samples t-test on SYSBP_BEFORE vs SYSBP_AFTER with 15 Cases 
Alternative = 'greater than' 


Mean SYSBP_BEFORE : 176.933 
Mean SYSBP_AFTER : 158.000 
Mean Difference : 18.933 
95.00$ Confidence Bound : 14.828 
Standard Deviation of Difference : 9.027 
£ 1 8.123 
df : 14 
p-value Н 0.000 
Paired t-test 

220 

210 

200 

190 
p 
5170 

160 

150 

140 

130 — 

1 

SYSBP AFTER SYSBP_BEFORE 


Index of Case 


From the above graph, the systolic blood pressure seems to have e ui d yg 
after administering the drug captopril. The test results (mean difference=18.933, 
p=0.000) indicate that the drug captopril reduces the systolic blood pressure. 


Example 5 
Two-Sample t-Test 


To illustrate the use of a two-sample t-test, we use the мрава c duin 
the average income for males differs from that for mod Esa 
one case for each subject, with the annual income (IN 


character code to identify the sex (SEXS). 
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The input is: 
TESTING 
USE SURVEY2 
ТТЕ5Т INCOME*SEX$ / СОМЕ-0.95 ALT=NE 
The output is: 
Two-sample t-test on INCOME Grouped by SEX$ vs Alternative = "пос equal' 
| Standard 
GROUP i N Mean Deviation 
* хр det РАЙ aon жанына AE Nadine 
Female ! 152 20.257 14.828 
Male 1104 24.971 16.418 
Separate Variance 
Difference in Means : -4.715 
95.00% Confidence Interval : -8.676 to -0.753 
te $^ -2.346 
df : 206.233 
p-value 1 0.020 
Pooled Variance 
Difference in Means 4: 7542723 
95.00% Confidence Interval : -8.597 to -0.832 
t б -2,397 
ағ : 254.000 
p-value 5 0.018 
Two-sample t-test 


0 
50 40 30 20 10 0 10 20 30 40 5 
Count Count 


The average income among males and females differs significantly. The box-and- 
whisker plot and the p-value (0.018) support this conclusion. 
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Example 6 
Poisson Test 


The JUICE data contains the number of defective orange juice-cans (Montgomery, 
2004). It is of interest to find out if the mean defective juice-cans are 6. The Poisson 
test is an appropriate test here. 


The input is: 


TESTING 
USE JUICE 
POISSON DEFECTS=6/ CONF=0.95 ALT=NE 


The output is: 


Poisson Test of DEFECTS with 24 Cases 
Ho: Mean = 6.00 vs Alternative = "пос equal' 


Exact Test 
Sample Mean : 5.458 
p-value : 0.297 
Normal Approximation Test 
Sample Mean : 5.458 я 
95.001 Confidence Interval : 2.095 to 2.495 (for Square Root of Poisson mean) 


2 3 -1.517 
p-value : 0.129 


Poisson Test 


Count 
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The p-value obtained from the tests indicates that there is no significant difference in 
the mean number of defective Juice cans from the assumed population mean. 


Example 7 қ 
Bonferroni and Dunn-Sidak adjustments 


How do the years of life expectancy of males and females in the developed and 
emerging nations differ? We use the OURWORLD data for 57 countries. Variables 
recorded for each case (country) include URBAN (percentage of the population living 
in urban areas), LIFEEXPF (years of life expectancy for females), LIFEEXPM ( years 
of life expectancy for males), and GDP$ (grouping variable with codes Developed and 


Emerging). 
The input is: 
TESTING 
USE OURWORLD 
FORMAT 8 E 
TTEST URBAN LIFEEXPM LIFEEXPF*GDP$/BONF DUNN CONF-0.95 ALT-N 
FORMAT 
The output is: 
Two-sample t-test on URBAN Grouped by GDP$ vs Alternative - 'not equal' 
i Standard 
GROUP iN Mean Deviation 
------------ >к эдас а. A EUER ЙН чы ә клы 
Developed ! 29 66.10344828 16.84243117 
Emerging 1^27 38.55555556 19.69446102 


Separate Variance 


Difference in Means + 27.54789272 

95.008 Confidence Interval : 17.68430242 to 37.41148302 
t : 5.60601761 

df 


: 51.35318370 


P-value : 0.00000083 
Bonferroni Adjusted p-value : 0.00000248 


Dunn-Sidak Adjusted P-value : 0.00000248 


Pooled Variance 


Difference in Means : 27.54789272 
95.00% Confidence Interval ; 17.75140375 to 37.34438169 
t : 5.63775383 
df : 54.00000000 
p-value + 0.00000065 


Bonferroni Adjusted P-value : 0.00000194 
Dunn-Sidak Adjusted P-value : 0.00000194 


Two-sample t-test 


URBAN 
ва 5 & 8B d mS 8 


„© 


Two-sample t-test 


! 30 70.83333333 
127 58.70370370 


Developed 
Emerging 


Separate Variance 


Difference in Means 

95.00% Confidence Interval 
t 

df 

p-value 

Bonferroni Adjusted p-value 
Dunn-Sidak Adjusted p-value 


Pooled Variance 


Difference in Means 

95.00% Confidence Interval 
t 

df 

p-value 

Bonferroni Adjusted p-value 
Dunn-Sidak Adjusted p-value 
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on LIFEEXPM Grouped Бу GDP$ vs Alternative = "пос equal’ 


Standard 
Deviation 


9.96846881 


12.12962963 
7.97424317 to 16.28501609 
5.93974079 
32.85972435 
0.00000117 
0.00000351 
0.00000351 


: 12.12962963 


8.19694171 to 16.06231755 
6.18109495 
55.00000000 
0.00000008 
0.00000025 
0.00000025 
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Two-sample t-test 


LIFEEXPM 
8 


8 


Separate Variance 


Difference in Means 
95.00% Confidence Interval 
t 


ағ 

P-value 

Bonferroni Adjusted p-value 
Dunn-Sidak Adjusted p-value 


Pooled Variance 


Difference in Means 
95.00% Confidence Interval 
t 


df 

P-value 

Bonferroni Adjusted p-value 
Dunn-Sidak Adjusted P-value 


Standard 
Deviation 


4.47740175 
11.03490964 


15.43333333 
10.80686663 
6.78218991 
33.61396681 
0.00000009 
0.00000027 
0.00000027 


15.43333333 
11.04515744 
7.04827869 
55.00000000 
0.00000000 
0.00000001 
9.00000001 


to 20.05980003 


to 19.82150922 
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Two-sample t-test 


80 


a 


LIFEEXPF 


40 
25 20 15 0 5 0 5 10 15 20 25 


Count Count 


On the average, 66.1% of the inhabitants of developed nations live in urban areas, 
areas. Note that the sample size, 


while 38.6% of those in emerging nations live in urban 
N, is 29 + 27 = 56, but there аге 57 cases in the QURWORLD file (the value of URBAN 


for Belgium is missing). Compare the df for the two tests—51.4 versus 54. Thus, 


considering graphical displays, the standard deviations, and the small difference 
between the df’s for the two tests, we are not uncomfortable reporting results for the 
pooled variance test. Significantly, more people in developed nations live in urban 
areas than do people in emerging nations (t= 5.638, df= Арама 0.0005), 
Simply view this output as ап illustration of the mechanics of the adjustment 
features. A difference between a probability of 0.00000083 and 0.00000248 is 
negligible, considering possible problems іп sampling, errors inthe data, or a failure to 
meet necessary assumptions. Suppose you scan the results for 100 variables, à. 
t-test is not significant when multiple 


probability of 0.0006 for a separate variance 

testing is considered, since the Bonferroni adjusted probability would be 0.06. 
Focusing on female life expectancy, the standard deviation ind jn dan for 

nations is more than two times larger than that for the developed nations, an P 


: і lude 
the separate variance test drops to 33.6. Using the separate neri "y! ri 
that an average life expectancy of 77.4 years differs significantly from y 
(t = 6.782, df = 33.6, p-value < 0.0005). \ 

‘onclusi candi : similar to those for females, 
Conclusions regarding male life expectancy are orti tis ико 


except tha ales, li is, on the average ў 
руна in emerging nations. You could use а paired 


70.8 years in developed nations and 58.7 i 
t-test to check if the sex difference is significant. 
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Ехатріе 8 
Test for Single Variance 


The data file EGYPTDM consists of four measurements of male Egyptian skulls from 
five different time periods ranging from 4000 B.C. to 150 A.D. Thirty skulls are 
measured from each time period. The measurements taken are: 


MB: Maximal Breadth of Skull 
BH: Basibregmatic Height of Skull 
BL: Basialveolar Length of Skull 
NH: Nasal Height of Skull 


It is of interest to see if the variance of the maximal breadth of skull is equal to 25, 


The input is: 


TESTING 
USE EGYPTDM 
VARI MB = 25 / ALT=NE 


The output is: 


Опе-затрје Variance Test of MB with 150 Cases 
Ho: Variance = 25.00 vs Alternative = 


"not equal' 
Mean : 133.973 
Variance : 23.919 
95.00% Confidence Interval : 19.297 to 30.435 
Chi-square : 142.556 
df Н 149 
P-value E 0.734 
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Hypothesis Testing 
Single Variance 
50 
40} 
30r 
c 
2 
8 
20+ 
10} 
0 NEA Із 
Qo 120 49 МЫС” 
MB 
ore у is 
The observed variance falls very well within the 95% confidence interval, that 


19.297 to 30.435. Since the p-value is 0.734, we conclude that the variance of the 
maximal breadth of skull does not exceed 25. 


Example 9 
Test for Equality of Two Variances | 


a 641 а ^ iu 
sing Ё investigate if there is any difference between 
Using EGYPTDM data, we shall investig hd A 


variation in the nasal height of skull from 


The input is: 


TESTING 
USE EGYPTDM 
SELECT (YEAR= 
VARI NH*YEAR 


-4000) OR (YEAR=-3300) 


The output is: 
to 
pata for the following results were selected according 


SELECT (YEAR=-4000) OR (YEARe- 3300) 


' 
а by YEAR У5 alternative = "пос equal 


Two-sample Variance Test of NH Groupe 
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GROUP | N Mean Variance 
ELE И А eee 
-4000 | 30 50.533 7.637 
-3300 | 30 50.233 8.737 
95.00% Confidence Interval : 0.416 to 1.836 
F-ratio 0.874 
ағ : 29, 29 
p-value : 0.720 
Equality of Two Variances 
From the density plot, we observe that the Variation in nasal height of the skulls is not 
changed from 4000 B.C. to 3300 B.C. This observation is further confirmed by the 
p-value (0.720). 
Example 10 


Test for Е, quality of Several Variances 


аа So the interest is in comparing the 
variability in power. The data are the coded deviations from target power using 


monomers from three different Suppliers. The data are in the POWER data fi le. 
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The input is: 


TESTING 
USE POWER 
VARI POWER* SUPPLIER 


| The output is: 
| 
| Equality of Several Variances Test of POWER Grouped by SUPPLIER 
| GROUP | М Меап Уагіапсе Median 
Shit" %-------е-ч«--се------------------- 

| 1 NC 189.144 8.693 189.100 

2 19 174.000 6.885 174.100 

3 r9 203.944 80.215 204.400 


Bartlett's Test 


Chi-square : 14.507 
p-value : 0.001 


Levene's Test 

F-ratio : 4.545 

p-value : 0.021 
From the output, there seems to be a lot of variation in the power of the soft contact 
lens for the monomers supplied by the supplier 3. The p-value (0.02 1) suggests that the 
data do not support the hypothesis of equality of variances. 


Example 11 
Test for Zero Correlation Coefficient 


whether the maximal breadth and nasal 


For the data in EGYPTDM, we investigate 
ar 4000 B.C. For this, we perform the test 


height of skull are correlated or not in the ye 
for zero correlation. 


The input is: 


TESTING 
USE EGYPTDM 
SELECT (YEAR=-4000) 
TCORR MB NH 
SELECT 
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The output is: 


Data for the following results were selected according to 
SELECT (YEAR--4000) 


Test for Correlation Coefficient - 0 of MB vs NH with 30 Cases 
Ho: Correlation - 0 vs Alternative - "пов equal' 


Н Standard 

GROUP } Mean Deviation 

es PET %-----------..22..... 

MB 1 131.367 5.129 

NH | 50.533 2.763 

Sample Size H 30 
Sample Correlation Coefficient : 0.511 
95.00% Confidence Interval : 0.185 to 0.736 
t : 3.147 
P-value 0.004 

Zero Correlation 
60 


45 


MIT 120 


130 140 150 
MB 


The test result (p-value = 0.004) rejects the null hypothesis, There is a positive 


correlation (0.511) between the maximal breadth and the nasal height of skull in the 
year 4000 В.С. 


Ехатріе 12 
Test for Specific Correlation Coefficient 


maximum breadth and the nasal height of the skulls found in the year 4000 B.C. We 
will further investigate to see if this correlation Coefficient equals 0.5. 
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The input is: 


TESTING 
USE EGYPTDM 
SELECT (YEAR=-4000) 
TCORR MB NH = 0.5 
SELECT 


The output is: 


Data for the following results were selected according to 
SELECT (year=-4000) 


Test for Correlation Coefficient between MB vs NH with 30 Cases 
Ho: Correlation = 0.50 vs Alternative = "пос equal' 


i Standard 
Population | Mean Deviation 
Poio a Aerie Ha EUM T EA 
MB | 131.367 5.129 
NH | 50.533 2.763 


Sample Size 30 


Sample Correlation Coefficient : 0.511 
95.00% Confidence Interval : 0.185 to 0.736 
5 : 0.078 
p-value : 0.938 
Specific Correlation 
60 


$40 120 130 140 150 


MB 


From the above graph as well as the results of the test, it ose rome ape 
between maximal breadth and nasal height of skull in the year "d 


differ significantly from the hypothesized value 0.5. 
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Example 13 
Test for Equality of Two Correlation Coefficients 


Does the correlation between the basibregmatic height and basialveolar length of the 


skull differ over the two different periods 200 B.C. and 150 A.D.? To investigate this 
we use the EGYPTDM data. 


, 


The input is: 


TESTING 
USE EGYPTDM 


TCORR BH 200BC BL 200BC - BH 150AD BL 150AD 


! Standard 
Population} N Mean Deviation 
br рн жазыл арын сото А re E 
BH 200BC ! 30 132.300 5.134 
BL 200BC + 30 94.533 4.592 
ВН 150Ар | 30 130.333 4.971 
ВІ, 150Ар | 30 93.500 5.057 


Correlation Coefficient of Set1 : 0.344 
Correlation Coefficient of Set2 t 


: 0.466 
Difference between Correlation Coefficients : -0.122 
2 : -0.539 
P-value : 0.590 
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Equality of Two Correlations 


om 


BL_200BC 


80 
110 120 130 140 150 
BH 200BC 
Equality of Two Correlations 
110 
9 100r 
8 
at 
Ф 90+ 
81 0 120 130 140 
BH 150AD 


een the basibregmatic height and basialveolar length 
B.C. is 0.344 and it is 0.466 for the skulls found in 
590 (> 0.05) indicates that there is no significant 

een these two characteristics over the two 


The correlation coefficient betw 
of the skull found in the year 200 
the year 150 A.D. The p-value = 0. 
difference in the correlation coefficient betwi 


time periods. 
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Ехатріе 14 
Test for Single Proportion 


According to the National Center for Education Statistics in Washington, D. С., 
approximately 16% of all elementary school teachers are men. A researcher randomly 
selected 1000 elementary school teachers in California from a statewide computer 
database and found that 142 were men. Does this sample provide sufficient evidence 
that the percentage of male elementary school teachers in California is different from 
the national percentage? The data are from Mendenhall et al. (2002). 


The input is: 


TESTING 
USE 


PROP 1000 142 = 0.16 / CONF=0.95 ALT=GT 


The output is: 


Single Proportion Test 
Ho: Proportion = 0.16 vs Alternative = 'greater than' 


Trials : 1000.000 
Successes : 142.000 


Normal Approximation Test 


Sample Proportion 2. 0.142 
95.00% Confidence Bound : 0.124 
7 2 -1.590 
p-value 0.944 
Large Sample Test 
Sample Proportion : 0.142 
95.00% Confidence Bound : 0.124 
2 3 -1.553 
p-value : 0.940 


The p-value indicates that there is no evidence that percentage of male elementary 
school teachers in California is greater than th 


е national percentage. Here in both of the 
cases, the two tests, one based оп а normal approximation, and the other a large sample 
test give approximately the same results. 


Example 15 
Test for Equality of Two Proportions 


In a comparison study of homeless and vulnerable meal 


-program users, Michael Sosin 
investigated determinants that account for a transition 


from having a home (domiciled) 
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but utilizing meal programs to become homeless. The following information provides 
the study data (Mendenhall et al., 2002): 


Homeless Men Domiciled Men 
Sample Size 112 260 
Number currently working 34 98 
Sample proportion 0.30 0.38 


The input is: 
TESTING 


USE 
PROP 112 34 = 260 98/ CONF=0.99 ALT=NE 


The output is: 


Equality of Proportions Test Уз Alternative = ‘not equal’ 


Population | Trials Successes Proportion 
DP adams decre = _ по mese i 
1 į 112.000 34.000 0.304 

2 1 260.000 98.000 0.377 


Normal Approximation Test 


Difference between Sample Proportions : RU 


7 372 
p-value 0.170 
Large Sample Test 
Difference between Sample Proportions = 01 £o 0.063 
39.00% Confidence Interval : о 
2 : -1.356 
с : 0.175 
p-value 
ts are approximately the same and they support 


The p-value obtained by both of the tes 
the null hypothesis. 
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-type test for variances from 


Acronym « Abbrevi 
Expansions -— 


A 

ABS - absolute value 

ACF - autocorrelation function 

ACOLOR - color axes 

ACS - arccosine 

ACT - actuarial life table 

AD test - Anderson Darling test 

ADDTREE - additive trees 2 
АРЕС - asymptotically distribution free estimate 
biased, Gramian / 
ADFU - asymptotically distribution free estimate 
unbiased 

ADJSEASON - seasonal adjustment 

AHMAX - maximum extent 

AHMIN - minimum extent 

AIC - Akaike information criterion 

AID - automatic interaction detection 

ALT - alternative 

ANCOVA - analysis of covariance | 

АМО! - deviation of angles from north in à 
Clockwise direction 

ANG2 - deviation of angles from horizontal (for 
3D models) 

ANG3 - tilt angle 

ANOVA - analysis of variance - 
ANOVAHYPO - hypothesis tests in analysis of 
variance 

AR - autoregressive | 
ARIMA - autoregressive integrated moving 
average 


ARL - average run length 
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Acronyms 


CFUNC - coefficients for the classification 
functions 

CGM - Computer graphics metafile: binary or 
clear text 

CHAZ - cumulative hazard 

CHISQ - Chi-square distribution 

CHOL - Cholesky decomposition 

CI - confidence interval 

CIF - Cauchy inverse function 

CIM - confidence interval of mean 
CLASS - classification 

CLSTEM - stem and leaf plot for column 
CMeans - canonical scores of group means 
CMULTIVAR - multiple string variables 
COEF - coefficients 

COL/col - column 

COLPCT - Column percentages 
COMFIG - configuration 

CONT - Contingency coefficient 

CONV - convergence 

CORAN - correspondence analysis 
CORR - correlations 

CORRI - single correlation coefficient 
CORR2 - equality of two correlations 
COV - covariance 

Cp - process capability index 

CPL - process capability based on lower 
specification limit 

CPU - process capability based on upper 
specification limit 

Cpk-Process capability index for off-centered 
process 

CR - confidence region 

CRA - cost of response above UTL 

CRB - cost of response below LTL 

CRN - Cauchy random number 
CSCORE - canonical scores 

CSIZE - size of characters 

CSQ - Chi-square 

CSTATISTICS - column statistics 

CSV - comma separated values 


CUSUM - cumulative sum 

CUSUM HI - Upper cumulative sum 
CUSUM LO - Lower cumulative sum 
CV - coefficient of variation 

CVI - cross validation index 


D 

DBF - Dbase files 
DC - deciles of risk 
DECF - Double exponential cumulative function 
DEDF - Double exponential density function 
DEIF - Double exponential inverse function 
DENFUN - density function 

dep. - dependent 

DERN - Double exponential random number 
DET - determinant 

DEVI - deviates (observed values - expected 
values) 

DEXP - Double exponential distribution 

df - degrees of freedom 

DF - distribution function 

DHAT - estimated distance 

DIF - data interchange format 

DIM - dimension 

DISCRIM - discriminant analysis 

DIST - distance 

DIT - dot histogram 

DOE - design of experiments 

DOS - disc operating system 

DPMO - defects per million opportunities 
DPU - defects per unit 

DTA - Stata files 

DUCF - Discrete uniform cumulative function 
DUDF - Discrete uniform density function 
DUIF - Discrete uniform inverse function 
DUNIFORM - Discrete uniform 

DURN - Discrete uniform random number 
DWLS - distance weighted least-squares 


E 
ECF - Exponential cumulative function 


EDF - Exponential density function 

EEXP - extreme value exponential 

EIF - Exponential inverse function 

EIGEN - eigenvalues 

ELAMBDA - exp(lambda) 

EM - expectation-maximization 

ЕМЕ - Windows enhanced metafile 

ENCF - Logit normal cumulative function 
ENDF - Logit normal density function 
ENIF - Logit normal inverse function 
ENORMAL - Logit normal 

ENRN - Logit normal random number 
EOF - end-of-file 

EOG - end-of-BY group 

EPS - Encapsulated postscript 

ERN - Exponential random number 

ES - exhaustive search 

ESS - error sum of squares 

EW - extreme value Weibull 

EWMA - exponentially weighted moving average 
EXP/exp - exponential/ expected 


F 

FAR - false-alarm rates 

FCF - F cumulative function 
FCOLOR - color foreground 

FDF - F density function 

FIF - F inverse function 

FINV - inverse of the F cumulative 
FITC - fitting distribution: continuous 
FITD - fitting distribution: discrete 
FITDIST - fitting distributions 
Flexibeta - flexible beta 

FPLOT - function plots 

FRN - F random number 

FTD - folded trellis detector 
FTDEV - Freeman-Tukey deviate 
FULLCOND - full conditional 
FUN - function 


G 
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Acronyms 


H-L trace - Holding-Lawley trace 

HR - hit-rates 

HRN - Hypergeometric random number 
HSD - honestly significant differences 
HTERM - terms tested hierarchically 
HTML - hyper text markup language 
HYMH - hybrid Metropolis-Hastings 


I 

IF - Inverse cumulative distribution function 
IGAUSSIAN - inverse Gaussian 

IGCF - Inverse Gaussian cumulative function 
IGDF - Inverse Gaussian density function 
IGIF - Inverse Gaussian inverse function 
IGRN - Inverse Gaussian random number 
IIDMC - independently and identically 
distributed Monte Carlo 

IMPSAMPI - importance sampling integration 
IMPSAMPR - importance sampling ratio 
I-MR - individual and moving range 
Ind/indep - independent 

IndMH - Independent Metropolis-Hastings 
CAL - individual differences scaling 
AMP - initial sample 

FUN - integrated function 

- iterated principal axis 

R - iterations 


J 


ТАСК - jackknife 

ICLASS - jackknifed classification 

IMP - JMP v3.2 data files 

PEG/IPG - joint photographic experts group 


K 

<-M - Kaplan-Meier 

<NBD - kth nearest neighborhood 

«RON - Kronecker product 

<-5 test - Kolmogorov-Smirnov test 

S1 - one sample Kolmogorov-Smirnov tests 
.S2 - two sample Kolmogorov-Smirnov tests 


L 

LAD - least absolute deviations 

LB - larger the better 

LCF - Logistic cumulative function 
LCHAZ - log cumulative hazard 

LCL - lower control limit 

LCONV - log-likelihood convergence criteria 
LDF - Logistic density function 

LGM - log gamma 

LGST - logistic 

LIF - Logistic inverse function 

L-L/LL - log likelihood 

LMS- least median of Squares 
LMSREG - least median of Squares regression 
LNCF - Lognormal cumulative function 
LNDF - Lognormal density function 
LNIF - Lognormal inverse function 
LNOR/LNORMAL - lognormal 

LNRN - Lognormal random number 

loc - location 

LOGI - one-parameter logistic (Rasch) 
LOG2 - two-parameter logistic 

LOGIT - logistic regression 
LOGITHYPO - hypothesis tests in logistic 
regression 

LOGLIN - loglinear modeling 

LR - likelihood ratio 

LRCHI - likelihood ratio chi-square 
LRDEV - likelihood ratio of deviate 
LRN - Logistic random number 

LS - least-squares 

LSD - least significant difference 

LSL - lower Specification limit 

LSQ- least-squares 

LTAB - life tables 

LTL - lower tolerance limit 

LW - Lawless and Wang 


M 
MA - moving average 


MAD - mean absolute deviation 

MAHAL - Mahalanobis distances 

MANCOVA - multivariate analysis of covariance 

MANOVA - multivariate analysis of variance 

МАМОМАНУРО - hypothesis tests in 

MANOVA 

MANOVAPOST - post hoc estimate for repeated 

measures in MANOVA 

MAR - missing at random 

MAX - maximum 

MAXSTEP - maximum number of steps 

MCAR - missing completely at random 

MCMC - Markov Chain Monte Carlo 

MDPREF - multidimensional preference 

MDS - multidimensional scaling 

MIN - minimum 

M-H- Metropolis-Hastings 

MIS - number of missing values 

MIX - mixed regression 

MIXHIER - mixed regression for data having 

hierarchical structure 

MIXMULTY - mixed regression for data having 

à multivariate structure 

ML - Maximum Likelihood 

MLA - maximum likelihood analysis 

MLE - maximum likelihood estimate 

MML - maximum marginal likelihood 

MRC - Multiple Regression and Correlation 

MS - mean squares 

MSE - mean square error 

MSIGMA - sigma measurement 

MT - Mersenne- Twister 

MTW - MINITAB v11 data files | 

MU2 - Guttman's mu2 monotonicity coefficients 

MULTIVAR - multiple variables ки 
W - minimum within sum of squares deviations 

MWL - maximum Wishart likelihood 


N 


E - non-stationary first-order autoregressive 
B - nominal the best 


| 
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Acronyms 


NZDIS - number of discretization points in the z 
(Depth) direction 


О 
Obs-observed 

OBSFREQ - observed frequency 

OC - operating characteristic 

ODBC - open database capture and connectivity 
OFREQ - outlier frequencies 

OLS - ordinary least-squares 
ORTHEQ-Equally Spaced Orthogonal 
component 

ORTHUN- Unequally Spaced Orthogonal 
component 


P 

P - Proportion nonconforming 

PACF - Pareto cumulative function 
РАСЕ - partial autocorrelation function 
PADF - Pareto density function 

PAIF - Pareto inverse function 
PARAM - parameters 

PARN - Pareto random number 

PCA - process capability analysis 
PCF - iterated principal axis factoring 
PCF - Poisson cumulative function 
PCNTCHANGE - percentage change 
PCT - Macintosh PICT 

PDF - Poisson density function 

pdf - probability density function 
PDL - polynomial distributed lag 
PERMAP - perceptual mapping 

PIF - Poisson inverse function 
PLIMITS - probability limits 

PLS - partial least Squres 

pmf - probability mass function 
PMIN - minimum proportion 

PNG - Portable Network Graphics 
POLY - polygon 

POSAC - partially ordered scalogram analysis 
with coordinates 


P-P - probability plot 

PP - process performance 

Ppk - Process performance index for off-centered 
process 

PPL - process performance based on lower 
specification limit 

PPM - parts per million 

PPU - process performance based on upper 
specification limit 

PRE - percentage reduction error 
PREFMAP - preference mapping 

PRN - Poisson random number 

PROB - probability 

РКОРІ - single proportion 

PROP2 - equality of two proportions 

PS - PostScript 

PVAF/p.v.a.f. -- present value annuity factor 
P-value - probability value 


Q 

QC - quality control 

QMLE - quasi maximum likelihood estimate 
QNTL - quantiles 

QPLOT - quantile plots 

Q-QPLOT - two sample quantile plot 
QRD - QR decomposition 

QS - quick search 

QSK - quantitative Symmetric similarity 
Coefficients (or Kulczynski measure) 
QUASI - Quasi-Newton method 


R 

R&R- repeatability and reproducibility 

R chart - range chart 

RADMAX - maximum horizontal direction for 
the search radius 

RADMIN - minimum horizontal direction for the 
search radius 

RAND - random 

RANDSAMP - random sampling 

RANKREG - rank regression 
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RBSTAT - row basic statistics 

RCF - Rayleigh cumulative function 

RDF - Rayleigh density function 
RDISCRIM - robust discriminant 

RDIST - robust distance 

RDVER - vertical direction for the search radius 
REPAR - reparametrize 

REPS - replicates 

RESID - residuals 

RIF - Rayleigh inverse function 

RJS - rejection sampling 

RMS - root mean square 

RMSEA - root mean square error of 
approximation 

RMSSTD - root mean square standard deviation 
ROC - receiver operating characteristic 
ROWPCT - Row percentages 

RRN - Rayleigh random number 

RS - response surface 

RSE- robust standard errors 

RSEED - random seed 

RSM- response surface methods 

RSQ - stress and squared correlation 

RSS - residual sum of squares 
RSTATISTICS - row statistics 

RTF - rich text format ) 
RWM-H - random walk Metropolis-Hastings 
RWSTEM - stem and leaf plot for rows 


S 

S chart - standard deviation control chart ‘ 
SANGI - angle (in degrees) of the first minor axis 
of the search ellipsoid у . 
SANG? - angle (in degrees) of the major axis of 
the search ellipsoid А 
SANG3 - angle (in degrees) of the second minor 
axis of the search ellipsoid 

SAV - SPSS files 

SB - smaller the better 

SC - scale 

SC - set correlation 


Acronyms 


SCDFUNC - standardized coefficients for 
canonical variables 
SCF - Studentized cumulative function 
SD - standard deviations 
sd2/sas7bdat - SAS v9 files 
SDF - Studentized density function 
SE/se/S.E. - standard error 
SEK - standard error of kurtosis 
SEM - standard error of mean 
SES - standard error of skewness 

- sh 
2 - pst inverse function i 
SIMPLS - Straight-forward Implementation of 
Partial Least Squares | 
SKMEAN - simple kriging mean 
SL - specification limit 
SMIN - minimum split value 
SPLOM - scatter plot matrix 
SQL - structured query language 
SQRT/SQR - square-root 
SRN - Studentized random number 
SRWR - sum of rank weighted residuals 

- of 8 
Береді үстіде and cross products 
STA - Statistica v5 data files 
STAND - standardized deviates | 
SVD - singular value decomposition 

- Shapiro-Wilks | 

e CH - SYSTAT command Files 
SYZ/SYD/SYS - SYSTAT data files 
5ҮО- SYSTAT output files 


TANALYZE - Taguchi design: analyze 
TCF -t cumulative function 

TCOR - total correlation 

TCOV - total covariance 

TDF -t density function | 
TESTAT - Test Item Analysis 
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Acronyms 


TESTATCL - classical test item analysis 


TESTATLOG - logistic item response analysis 


TETRA - tetrachoric correlations 
TGENERATE - Taguchi design: generate 
TIF - t inverse function 

TIFF - Tagged Image File Format 

TLOG - log time 

TLOSS - Taguchi's Loss Function 

TNH - hyperbolic tangent 


ТОНС0 - Hypothesis Testing: Zero correlation 


ТОНСІ - Hypothesis Testing: Specific 
correlation 


TOHC2 - Hypothesis Testing: Equality of two 


correlation coefficients 


ТОНРІ - Hypothesis Testing: Single proportion 
TOHP2 - Hypothesis Testing: Equality of two 


proportions 


TOHTI - Hypothesis Testing: One sample t-test 
TOHT2 - Hypothesis Testing: Two sample t-test 
TOHTPAIRED - Hypothesis Testing: Paired t- 


test 


ТОНУІ - Hypothesis Testing: Single variance 
TOHV2 - Hypothesis Testing: Two variances 
TOHVN - Hypothesis Testing: Several variances 
TOHZ1 - Hypothesis Testing: One sample z-test 
TOHZ2 - Hypothesis Testing: Two sample z-test 


TOL - tolerance 

TPLOT - time series plot 
TPREDICT - Taguchi design: predict 
TRCF - Triangular cumulative function 
TRDF - Triangular density function 
TRI - triangular 

TRIF - Triangular inverse function 
TRIM - trimmed mean 

TRN - t random number 

TRP - transpose 

TRRN - Triangular random number 


TSFOURIER - Fourier decomposition of time 


series 
TSIV - Two-Stage Instrumental Variables 
TSLS - Two-Stage Least Squares 


TSP - traveling salesman path 

TSQ chart - Hotelling's T? chart 
TSSMOOTH - smoothing time series 
TXT - text format 


U 

U chart - chart showing defects per unit 
UCF - Uniform cumulative function 
UCL - upper control limit 

UDF - Uniform density function 

UIF - Uniform inverse function 

UNCE - uncertainty coefficient 

URN - Uniform random number 

USL - upper specification limit 

UTL - upper tolerance limit 


V 
VAR - variance 
VIF - variance inflation factor 


W 

WB - Weibull 

WCF - Weibull cumulative function 
WCOR - pooled within-group correlation 
WCOV - pooled within-group covariance 
WDF - Weibull density function 
WHISKER - Box-and-Whisker plot 

WIF - Weibull inverse function 

WMF - Windows metafile 

WRN - Weibull random number 


X 


ХСЕ - Chi-square cumulative function 
XDF - Chi-square density function 

XIF - Chi-square inverse function 

XLAG - separation distance between lags 
XLS - excel format 

XLTOL - tolerance for lags 

XMAX - maximum along x axis 

XMIN - minimum along x axis 
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X-MR chart - Individuals and moving range chart 
XPT/TPT - SAS transport files 

XRN - Chi-square random number 

XTAB - Crosstabulations 


Y 
YMAX - maximum along y axis 
YMIN - minimum along y axis 


Z 

71 - one-sample z-test 

Z2 - two-sample z-test 

ZCF - Normal cumulative function 
ZDF - Normal density function 
ZICF - Zipf cumulative function 
ZIDF - Zipf density function 
ZIF - Normal inverse function 
ТЛЕ - Zipf inverse function 
ZIRN - Zipf random number 
ZMAX - maximum along 2 axis 
ZMIN - minimum along 2 axis 
ZRN - Normal random number 


Acronyms 


А 


A matrix, 1-192 
accelerated failure time distribution, ГУ-433 
ACF plots, IV-529 
additive trees, I-80, 1-91 
AIC and Schwarz's BIC, 11-39, П-108, 11-292, I- 
300, П-344, П-385, Ш-1, Ш-258, IV-99, IV-427 
see linear models, II-17 
Akaike Information Criterion, Ш-458 
alpha level, IV-22, IV-28 
alternative hypothesis, 1-13, IV-20 
analysis of covariance, П-153, П-209 
examples, П-170 
analysis of variance, П-107 
AIC and Schwarz’s BIC, П-108 
algorithms, П-171 
assumptions, П-25 
between-group differences, П-32 
commands, П-121 
compared to loglinear modeling, Ш-95 
compared to regression trees, 1-45 
contrasts, П-28, 1-113, II-1 15, II-116 
data format, II-121 
examples, II-122, 1-126, 11-132, 11-145, I- 
146, П-148, II-151, П-155, П-160, 
П-163, П-166, П-170 
factorial, I-24 
homogeneity tests, П-113 
hypothesis tests, 11-23, 11-113, П-115, П-116 
interactions, П-25 
normality tests, П-112 
pairwise comparisons, П-117 
power analysis, ІУ-19, IV-26, IV-55, ІУ-57, 
IV-77, ГУ-80 
Quick Graphs, П-121 


Index 


repeated measures, П-31, II-110 
resampling, П-108 
residuals, II-110 
sums of squares, II-113 
two-way ANOVA, IV-26, IV-57, IV-80 
unbalanced designs, 11-29 
unequal variances, П-26 
usage, II-121 
within-subject differences, П-32 
Anderberg dichotomy coefficients, 1-164, 1-173 
Anderberg’s binary similarity coefficient, I-164 
Anderson-Darling test, 1-303 
Andrews procedure, Ш-279 
angle tolerance, IV-388 
anisotropy, IV-392, IV-405 
geometric, IV-392 
zonal, IV-393 
A-optimality, 1-364 
ARIMA models, IV-514, IV-523, IV-540 
algorithms, IV-578 
arithmetic mean, 1-299, 1-308 
ARMA models, IV-519 
asymptotically distribution-free estimates, Ш-412 
autocorrelation plots, I-11, IV-516, IV-520 
Automatic Interaction Detection(AID), 1-45, 1-47 
autoregressive models, IV-516 
average run length curves, IV-134 
chart types, IV-137 
continuous distributions, IV-139 
discrete distributions, IV-140 
overview, IV-134 
probability limits, IV-137 
axial designs, 1-360 


Index 


backward elimination, 11-15 
bandwidth, IV-350, IV-355, IV-388 
optimal values, IV-356 
relationship with kernel function, IV-357 
basic statistics 
Anderson-Darling test, 1-303, 1-309 
columns, 1-307 
commands, 1-322 
Cronbach's alpha, 1-321 
examples, 1-324, 1-326, 1-327, 1-328, 1-333, I- 
338, 1-340, 1-341, 1-342 
geometric mean, 1-300, 1-308 
harmonic mean, 1-300, 1-308 
multivariate normality assessment, 1-303 
N-&P-tiles, 1-309 
overview, I-297 
Quick Graphs, 1-323 
resampling, 1-298 
rows, 1-316 
Shapiro-Wilk test, 1-302, 1-309 
stem-and-leaf for columns, I-314 
stem-and-leaf for rows, I-320 
test for normality, 1-302 
trimmed mean, 1-299, 1-308 
usage, 1-323 
bayesian regression, II-50 
credibility intervals, [I-50 
gamma prior, П-52 
normal prior, П-52 
best linear unbiased estimates(BLUE), 11-344, II- 
386 
best linear unbiased predictors (BLUP), 11-344, П- 
386 
beta level, ТУ-22 
between-group differences 
in analysis of variance, П-32 
bias, П-15 
binary logit, Ш-2 
compared to multinomial logit, Ш-5 
binary trees, I-43 
biplots, IV-6, IV-8 


bisquare procedure, Ш-279 

biweight kernel, IV-365 

Bonferroni inequality, I-47 

Bonferroni test, I-175, П-27, П-118, II-196, 11-307, 
11-394 

bootstrap, 1-19, 1-21 

box plot, 1-305 

Box-and-Whisker plots, ГУ-112 
Box-Behnken designs, 1-357, 1-380 
Box-Cox power transformation, IV-157 
Box-Hunter designs, 1-353, 1-373 
Bray-Curtis measure, 1-162, 1-172 
broad inference space, П-280 


C 


c charts, IV-131 
C matrix, П-193 
candidate sets 
for optimal designs, 1-363 
canonical correlation analysis 
data format, IV-304 
examples, IV-305, IV-308, IV-312 
interactions, IV-304 
model, IV-299 
nominal scales, IV-304 
overview, IV-29] 
partialled variables, IV-300 
Quick Graphs, IV-305 
resampling, IV-29] 
rotation, IV-303 
usage, IV-304 
canonical rotation, [V-7 
categorical data, Ш-321 
categorical predictors, 1-45 
Cauchy kernel, ГУ-365 
CCF plots, IV-531 
central composite designs, 1-356, 1-384 
centroid designs, 1-359 
CHAID, 1-46, 1-47 
chi-square tests for independence, 1-229, 1-233, I- 
242 
circle model 


Index 


in perceptual mapping, ГУ-5 saving files, 1-95 
city-block distance, I-172, Ш-191 usage, 1-95 
classical analysis, ГУ-488 clustered data, 11-421 
classification and regression trees, I-41 clustering 
classification functions, 1-396 hierarchical clustering, 1-68 
classification trees k-clustering, I-78 
algorithms, 1-62 validity, 1-87 
basic tree model, 1-42 Cochran's test of linear trend, 1-234 
commands, 1-54 coefficient of alienation, Ш-190, Ш-212 
compared to discriminant analysis, 1-46, 1-49 coefficient of determination 
‚ 1-46 see multiple correlation 


coefficient of variation, 1-307 
Cohen's kappa, I-226, 1-234 
communalities, 1-458 
compound symmetry, 1-32 
conditional logistic regression, Ш-5 
confidence curves, Ш-273 
confidence intervals, I-11, 1-307 
path analysis, Ш-455 
conjoint analysis 
additive tables, I-126 
algorithms, I-152 
commands, I-135 
compared to logistic regression, I-132 
data format, 1-135 


data format, 1-54 
displays, 1-51 

examples, 1-55, I-57, 1-59 
loss functions, 1-51 
missing data, 1-62 
mobiles, 1-41 

model, І-51 

overview, 1-41 

pruning, 1-47 

Quick Graphs, 1-54 
resampling, 1-41 

saving files, 1-54 

stopping criteria, 1-47, 1-53 


2 5 1-54 examples, І-136, 1-140, 1-143, 1-147 
cluster analysis ee 
additive trees, 1-91 missing data, 1-153 
ra model, 1-133 


algorithms, 1-122 multiplicative tables, I-128 
clustering, 1-65 i 
d ЖҮ” overview, 1-125 
А е Ф ~ : Quick Graphs, 1-135 
ata types, 1- resampling, 1-125 


distances, 1-84 у е 
examples, 1-96, 1-105, 1-108, 1-109, 1-111, I- saving files, 1-135 


112, 1115, 1-116, I-118, 1-120 Рас 1:135 

exclusive clusters, 1-66 oon in mixture designs, 1-360 
nen aie i wang eae, La 

4 ку Ag t, IV-243 
k-medians clustering, 1-79 е = 1V-401 

. . , 
аара 66 contrast coefficients, 1-31 

: x contrasts 

— a in analysis of variance, П-28 
Quick Graphs, I- control charts 


resampling, 1-66 


Index 


aggregated data, ГУ-120 
average run length curves, IV-136 
control limits, IV-121 
discrete control limits, IV-121 
operating characteristic curves, IV-135 
raw data, IV-120 
regression charts, IV-152 
sigma limits, IV-122 
convergence, Ш-98 
convex hulls, IV-398 
Cook's distance, II-12 
Cook-Weisberg graphical confidence curves, Ш- 
273 
coordinate exchange method, 1-363, 1-386 
correlations, 1-67, I-157 
algorithms, 1-199 
binary data, 1-173 
canonical, ГУ-291 
commands, 1-177 
continuous data, 1-171 
data format, 1-178 
dissimilarity measures, 1-172 
distance measures, 1-172 
examples, 1-179, 1-182, I-185, 1-186, 1-188, I- 
192, 1-195, 1-196, 1-198 
missing values, 1-170, 1-199, III-135 
options, 1-174 
overview, I-157 
power analysis, IV-19, ГУ-25, IV-42, IV-44 
Quick Graphs, 1-178 
rank-order data, 1-172 
resampling, 1-158 
saving files, I-179 
set, ІУ-291 
usage, 1-178 
correlograms, IV-403 
correspondence analysis, IV-2, IV-6 
algorithms, 1-218 
commands, 1-206 
data format, 1-206 
examples, 1-207, 1-214 
missing data, I-218 
model, 1-204 


overview, 1-201 
Quick Graphs, 1-206 
resampling, 1-201 
simple correspondence analysis, I-204 
usage, 1-206 
covariance matrix, 1-171, Ш-135 
covariance paths 
path analysis, Ш-401 
covariograms, IV-387 
Cox-Snell residual plot, IV-434 
Cramer's V, 1-227 
critical level, 1-13 
Cronbach's alpha, IV-488, IV-489 
see basic statistics, I-321 
crossover designs, II-175 
crosstabulation 
commands, 1-244 
data format, 1-246 


examples, 1-248, 1-250, 1-253, 1-256, 1-257, I- 
258, 1-261, 1-263, 1-269, 1-271, І- 


273, 1-275, 1-277, 1-279, 1-293 

multiway, I-237 

one-way, 1-220, 1-222, 1-228 

overview, 1-219 

Quick Graphs, 1-247 

resampling, 1-219 

standardizing tables, I-221 

two-way, 1-220, 1-223, І-231 

usage, 1-246 
cross-validation, 1-48, 1-396, П-16, Ш-360 
cumulative sum charts 

see cusum charts, IV-142 


D 


D matrix, 11-194, П-288, 11-309, 11-355, 11-397 
D SUB-A (d,), IV-321 
dates, IV-430 
dendrograms, 1-65, I-107 
dependence paths 
path analysis, Ш-399 
descriptive statistics, I-] 
see basic statistics, 1-307 


design of experiments, 1-132, 1-368, 1-369 
axial designs, 1-360 
Box-Behnken designs, 1-357 
central composite designs, 1-356 
centroid designs, 1-359 
commands, 1-370 
examples, 1-371, 1-372, 1-373, 1-375, 1-377, 1- 
379, 1-380, 1-381, 1-382, 1-384, 1-386 
factorial designs, 1-349, 1-350 
lattice designs, 1-359 
mixture designs, 1-350, 1-357 
optimal designs, 1-350, 1-362 
overview, 1-345 
Quick Graphs, 1-371 
response surface designs, 1-350, 1-354 
screening designs, 1-360 
usage, 1-370 
determinant criterion 
see D-optimality 
Dice's binary similarity coefficient, 1-164 
dichotomy coefficients, 1-164 
Anderberg, 1-173 
Jaccard, 1-173 
positive matching, 1-173 
simple matching, 1-173 
Tanimoto, 1-173 
difficulty, IV-507 
discrete choice model, Ш-7 
compared to polytomous logit, Ш-8 
discrete gaussian convolution, IV-361 
discriminant analysis 
classical discriminant analysis, 1-400 
commands, I-407 
data format, 1-408 
estimation, 1-401 
examples, 1-409, 1-413, 1-420, 1-427, 1-435, І- 
438, 1-444, 1-449 
linear discriminant function, 1-397 
model, 1-400 
multiple groups, 1-399 
options, 1-401 
overview, 1-391 
prior probabilities, 1-398 


Index 


Quick Graphs, 1-408 
resampling, 1-391 
robust discriminant analysis, I-399 
statistics, 1-404 
stepwise estimation, 1-401 
usage, 1-408 
discrimination parameter, IV-507 
dissimilarities 
direct, Ш-187 
indirect, Ш-187 
distance measures, 1-67, I-157 
distances 
nearest-neighbor, IV-396 
distance-weighted least squares (DWLS) smoother, 
IV-361 
distributions 
Benford's law, 1-499, Ш-332, IV-86, IV-221 
beta, 1-500, Ш-333, 111-335, IV-88, IV-222 
binomial, 1-499, Ш-332, IV-86, IV-221 
Cauchy, 1-500, Ш-333, 111-335, IV-88, IV-222 
chi-square, 1-500, Ш-333, 11-335, IV-88, IV- 
222 
discrete uniform, 1-499, Ш-332, IV-86, IV- 
221 
double exponential, 1-501, Ш-335, IV-88, IV- 
222 
Erlang, 1-501, Ш-335, IV-88, IV-222 
exponential, 1-501, 11-333, Ш-336, IV-88, 
IV-222 
F, Ш-333, Ш-336, IV-88, ГУ-222 
gamma, 1-501, Ш-333, Ш-336, ІУ-89, IV-222 
generalized lambda, IV-222 
geometric, 1-499, Ш-332, IV-86, IV-221 
Gompertz, 1-501, Ш-333, Ш-336, IV-89, IV- 
222 
Gumbel, 1-501, 111-333, 111-336, IV-89, IV- 
222 
hypergeometric, 1-499, Ш-332, IV-86, IV-221 
inverse Gaussian, 1-501, 111-333, 111-336, IV- 
89, IV-222 
logarithmic series, 1-499, III-332, IV-87, IV- 
221 
logistic, I-501, 111-333, Ш-336, IV-89, IV-222 


Index 


logit normal, 1-501, III-333, Ш-336, IV-89, Е 


IV-222 
loglogistic, I-501, Ш-333, Ш-336, IV-89, ГУ- ЕСУІ, Ш-458 

222 edge effects, ГУ-398 
lognormal, 1-501, Ш-333, Ш-336, IV-89, IV- effect size 

222 in power analysis, IV-22, IV-23 
negative binomial, 1-499, Ш-333, IV-87, IV- effects coding, II-20, П-180 

221 efficiency, 1-362 
non-central chi-square, Ш-333, Ш-336, IV-89, eigenvalues, 1-405 

IV-222 ellipse model 
non-central Е, Ш-333, Ш-336, IV-89, IV-222 in perceptual mapping, IV-6 


non-central t, Ш-333, Ш-336, IV-89, ТУ-222 ЕМ algorithm, 1-492 
normal, 1-501, Ш-333, 111-336, IV-89, IV-222 ЕМ estimation, Ш-130 


Pareto, 1-501, 11-333, Ш-336, IV-89, IV-222 for correlations, 1-175, III-135 
Poisson, 1-499, Ш-333, IV-87, IV-221 for covariance, III-135 
Rayleigh, 1-501, Ш-333, Ш-336, IV-89, IV- for SSCP matrix, III-135 
223 endogenous variables 
smallest extreme value, 1-501, III-333, Ш-336, path analysis, Ш-400 
IV-89, IV-223 Epanechnikov kernel, IV-364 
studentized maximum modulus, Ш-333, Ш- — equamax rotation, 1-460, 1-464 
336, IV-89 Erlang, Ш-333 
Studentized range, III-336 Estimation, Ш-135 
studentized гапре, Ш-333, IV-89, IV-223 Euclidean distances, Ш-188 
t, Ш-333, Ш-336, IV-89, IV-223 exogenous variables 
triangular, 1-501, Ш-334, Ш-336, IV-89, IV- path analysis, Ш-400 
223 expected cross-validation index, III-458 
uniform, 1-501, Ш-334, Ш-336, IV-89, IV- Exponential, Ш-336 
223 exponential distribution, IV-432 
Weibull, 1-501, Ш-334, 11-336, IV-89, IV- exponential model, IV-390, IV-404 
223 exponential smoothing, IV-524 
тірі, 1-499, Ш-333, IV-87, IV-221 exponentially weighted moving average charts, IV- 
dit plots, I-14 146 
D-optimality, 1-364 control limits, IV-147 
dot histogram plots, I-14 external unfolding, IV-4 
Double, III-333 
D-Prime (d' ), IV-320 F 
dummy codes, II-180 F, Ш-333 
Duncan test, П-27, II-119, 11-197 F and К matrices, 11-308, 11-354, 11.396 
Dunnett test, П-27, П-119, П-197 F distribution , 
Dunnett's T3 test, 1-27, П-119, П-197 Е matrix, 11-287 
Dunn-Sidak test, I-175 : 


factor analysis, 1-457, IV-2 
algorithms, 1-492 
commands, 1-468 


compared to principal components analysis, 1- 
460 
convergence, 1-463 
correlations vs covariances, 1-457 
eigenvalues, 1-463 
eigenvectors, 1-467 
examples, 1-469, 1-473, 1-476, 1-478, 1-482, I- 
485 
iterated principal axis, 1-463 
loadings, 1-467 
maximum likelihood, 1-463 
missing values, 1-492 
number of factors, 1-463 
overview, 1-453 
principal components, 1-463 
Quick Graphs, 1-468 
resampling, 1-453 
residuals, 1-465 
rotation, 1-459, 1-464 
save, 1-466 
scores, 1-466 
usage, 1-468 
factor loadings, IV-488 
factorial analysis of variance, П-24 
factorial designs, 1-349, 1-350 
analysis of, 1-353 
examples, 1-371 
fractional factorials, 1-352 
full factorial designs, 1-352 
F-distribution 
non-centrality parameter, IV-60 
Fedorov method, 1-363 
Fieller bounds, 1-48 
filters, IV-527 
Fisher's exact test, 1-226, 1-233 
Fisher's linear discriminant function, IV-2 
Fisher’s LSD, П-197 
Fisher's LSD test, 11-27, 11-118, 11-307, 11-395 
fitting distributions 
commands, 1-501 
examples, 1-504, 1-505, 1-507, 1-508, 1-510, I- 
511, 1-513 
goodness-of-fit tests, 1-496 


Index 


maximum likelihood method, 1-497 

method of moments, 1-497 

method of quantiles or order statistic, 1-497 

overview, 1-495 

Quick Graphs, 1-503 

Shapiro-Wilk's test for normality, 1-497 

usage, 1-503 
fixed effects, 11-279 
fixed variance 

path analysis, Ш-402 
fixed-bandwidth method 

compared to KNN method, IV-357 

for smoothing, ГУ-355, 1У-357, ІУ-364 
Fletcher-Powell minimization, IV-507 
forward selection, П-15 
Fourier analysis, IV-526, IV-545 
fractional factorial designs 

Box-Hunter designs, 1-353 

examples, 1-372, 1-373, 1-375, 1377, 1-379 

homogeneous fractional designs, 1-353 

Latin square designs, 1-353 

mixed-level fractional designs, 1-353 

Plackett-Burman designs, 1-353 

Taguchi designs, 1-353 
Freeman-Tukey deviates, 1-93, Ш-102 
frequencies, 1-23, 1-54, 1-135, 1-179, 1-206, 1-246, 
1-248, 1-323, 1-408, 1-468, 1-469, 1-503, 1-544, П- 
54, П-121, 1-122, 11-202, 11-310, 1-357, П-399, 
11-441, Ш-23, Ш-103, 111-104, Ш-137, Ш-194, Ш- 
217, Ш-283, Ш-339, Ш-364, Ш-385, Ш-413, ІУ- 
9, IV-62, IV-63, 1V-103, IV-162, IV-244, IV-280, 
IV-305, IV-325, 1V-328, IV-366, IV-410, IV-449, 
1V-495, IV-498, IV-547, IV-587 
frequency tables, Ш-93, Ш-102 

see crosstabulation 
Friedman test, Ш-328 


G 


Gabriel test, Il-27, 11-119, 11-197 
Games-Howell test, 1-27, П-119, П-197 
Gaussian kernel, ГУ-364, IV-365 
Gaussian model, IV-390, IV-404 


Index 


Gauss-Newton method, Ш-269, Ш-272 
general linear models, П-175 
algorithms, 11-249 
categorical variables, П-179 
commands, 11-200 
contrasts, П-189, 11-191 
data format, 11-201 
examples, II-203, П-211, II-212, П-213, I- 
215, П-217, П-220, П-222, П-224, 
П-234, П-237, П-238, П-242, П-246, 
П-247, 11-248 
hypothesis options, П-188 
hypothesis tests, П-186 
mixture model, П-184 
model estimation, П-177 
overview, П-175 
pairwise comparisons, П-195 
post hoc tests, П-199 
Quick Graphs, П-202 
resampling, II-176 
stepwise regression, II-183 
usage, П-201 
generalized least squares, Ш-412, IV-584 
generalized variance, ГУ-294 
geometric mean, 1-300, 1-308 
geostatistical models, IV-386, IV-387 
getween-groups testing, Ш-239 
Gini index, 1-48, 1-51 
GLM 
see general linear models, П-175 
global criterion 
see G-optimality 
GMA chart, IV-146 
Goodman-Kruskal gamma, 1-227, 1-234 
Goodman-Kruskal lambda, 1-234 
goodness-of-fit tests, I-496 
G-optimality, 1-364 
Gower2 binary similarity coefficient, I-164 
Graeco-Latin square designs, 1-353 
Greenhouse-Geisser statistic, [I-33 
Guttman mu2 monotonicity coefficients, I-162 
Guttman’s coefficient of alienation, Ш-190 
Guttman’s loss function, Ш-212 


Guttman-Rulon coefficient, ГУ-489 
H 


Hadi outlier detection, I-168 
Hamman's binary similarity coefficient, 1-164 
Hampel procedure, III-279 
Hanning weights, IV-512 
harmonic mean, 1-300, 1-308 
hazard function 
heterogeneity, IV-435 
Henderson’s mixed model equations, 11-279, 11-293 
Henze-Zirkler test, I-303 
heteroskedasticity, IV-583 
heteroskedasticity-consistent standard errors, IV- 
583 
hierarchical clustering, 1-68, I-82 
distances, 1-84 
validity index, 1-75 
hierarchical linear mixed models 
categorical variables, П-389 
commands, П-398 
examples, 11-399, П-402, П-406, 11-408, II- 
412, П-414, 11-417 
hypothesis testing, П-394 
model estimation, 11-387 
options, П-392 
overview, II-385 
Quick Graphs, П-398 
random effects, II-390 
usage, II-398 
hierarchical linear models 
see mixed regression 
hinge, 1-301 
Hochberg's GT2 test, П-27, П-119, П-197, 11-307, 
П-395 
hole model, IV-391, IV-405 
Holt's method, IV-524 
homogeneity tests, П-113 
Levene's test, II-113 
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two-stage least squares, IV-581 
regression charts, IV-152 
regression trees, 1-45 
algorithms, I-62 
basic tree model, 1-42 
commands, I-54 
compared to analysis of variance, 1-45 
compared to stepwise regression, I-46 
data format, I-54 
displays, 1-51 
examples, 1-55, 1-57, 1-59 
loss functions, 1-48, I-51 
missing data, I-62 
mobiles, 1-41 
model, I-51 
overview, І-41 
pruning, 1-47 
Quick Graphs, I-54 
resampling, 1-41 
saving files, 1-54 
stopping criteria, 1-47, 1-53 
usage, I-54 
R-E-G-W Q test, П-197 
R-E-G-W-Q test, П-27, П-119 
reliabilities, ГУ-492 
reliability, IV-489 
repeated measures, П-31 
assumptions, П-32 
resampling 
algorithms, 1-38 
bootstrap-t method, I-19 


command, I-22 
examples, 1-23, 1-27, 1-28, 1-33, 1-34, 1-36 
missing data, 1-38 
naive bootstrap, I-19 
overview, I-17 
Quick Graphs, I-22 
usage, I-22 
response optimization, IV-234 
canonical analysis, IV-234 
desirability analysis, IV-236 
ridge analysis, IV-235 
response surface designs, 1-350, I-354 
analysis of, 1-357 
Box-Behnken designs, 1-357 
central composite designs, 1-356 
examples, 1-380, 1-384 
rotatability, I-355, 1-356 
response surface methods, IV-231 
commands, IV-244 
contour and surface plot, IV-233, IV-243 
customization, IV-238 
estimate model, IV-237, IV-238 
examples, IV-245, IV-247, IV-249, IV-250 
lack of fit, IV-233 
optimize, IV-240 
overview, IV-23] 
Quick Graphs, IV-244 
usage, IV-244 
response surfaces, 1-132, III-273 
restricted/residual maximum likelihood estimates, 
П-385 
ridge regression, П-48 
right censored data, IV-428 
RMSEA, Ш-457 
robust discriminant analysis, 1-399 
robust regression 
commands, ГУ-279 
examples, IV-280, IV-283, IV-284 
LAD regression, IV-260 
LMS regression, IV-261 
LTS regression, IV-26] 
M-regression, IV-26] 
overview, IV-255 


Quick Graphs, IV-279 

rank regression, IV-262 

S regression, IV-262 

usage, IV-279 
robust smoothing, IV-358, IV-365 
robustness, Ш-321 
ROC curves, IV-320 
root mean square error of approximation, Ш-457 
rotatability 

in response surface designs, 1-355 
rotatable designs 

in response surface designs, 1-356 
rotation, 1-459 
Roy’s Greatest root, Ш-226 
running median smoothers, IV-512 
running-means smoother, ГУ-360 
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s charts, IV-126 
plotting with X-bar charts, IV-129 
Sakitt D, IV-321 
sample size, IV-23, IV-30 
samples, I-8 
saturated models 
loglinear modeling, Ш-95 
scale regression, IV-262 
scalogram 
see partially ordered scalogram analysis with 
coordinates 
scatterplot matrix, 1-160 
Scheffé model 
in mixture designs, 1-361 
Scheffé test, П-27, П-118, 1-197, 11-307, 11-395 
screening designs, 1-360 
SD-RATIO, IV-321 
seasonal decomposition, IV-523 
second-order stationarity, IV-387 
semi-variograms, IV-388 
set correlations 
assumptions, IV-292 
categorical variables, IV-301 
data format, IV-304 


measures of association, IV-293 

missing data, IV-316 

overview, IV-291 

partialing, IV-292 

usage, IV-304 
Shapiro-Wilk test, 1-302 
Shepard diagrams, Ш-189, Ш-194 
Shepard' s smoother, IV-360 
Shewhart control charts 

c charts, IV-131 

np charts, IV-129 

p charts, IV-130 

R charts, IV-128 

s charts, IV-126 

u charts, IV-133 

variance charts, IV-124 

X charts, IV-129 

X-bar charts, IV-123 
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Sidak test, 1-27, II-118, II-197, 11-307, 11-395 


sign test, Ш-325, Ш-326 

signal detection analysis 
algorithms, IV-346 
chi-square model, IV-323 
commands, IV-324 
convergence, IV-324 
data format, IV-325 


examples, IV-328, IV-333, IV-335, ІУ-336, 


IV-340, IV-342, IV-344 
exponential model, IV-323 
gamma model, IV-323 
logistic model, IV-323 
missing data, IV-346 
nonparametric model, IV-323 
normal model, IV-323 
overview, IV-319 
poisson model, IV-323 
Quick Graphs, IV-327 
ROC curves, IV-327 
usage, IV-325 
sill, IV-392 
similarity measures, 1-157 | 
simple matching dichotomy coefficients, 
173 


1-164, I- 
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simplex, 1-359 
Simplex method, Ш-269, Ш-273 
SIMPLS (Straight-forward IMplementation of Par- 
tial Least Squares) 
See partial least squares regression 
, Ш-377 
simulation, IV-394 
singular value decomposition, 1-201, IV-6, IV-16 
Skewness, 1-307 
positive, I-4 
slope, II-13 
smoothing, IV-362, IV-510 
bandwidth, IV-350, IV-355 
biweight kernel, IV-362, IV-364, IV-365 
Cauchy kernel, IV-362, IV-365 
commands, IV-366 
confidence intervals, IV-368 
data format, IV-366 
discontinuities, IV-360 
discrete gaussian convolution, IV-361 
distance-weighted least Squares (DWLS), IV- 
361 
Epanechnikov kernel, IV-362, IV-364 
examples, IV-367, IV-368, IV-370, IV-380 
fixed-bandwidth method, IV-355, IV-362, IV- 
364 
Gaussian kernel, IV-362, IV-364, IV-365 
grid points, IV-361, IV-362, IV-382 
inverse-distance, IV-360 
k nearest-neighbors method, IV-356 
kernel functions, IV-350, IV-352, IV-362, IV- 
364 
LOESS smoothing, IV-361, IV-362, IV-367, 
IV-368, IV-370, IV-380 
Marron & Nolan canonical kernel width, IV- 
357, IV-362, IV-364 
mean smoothing, IV-358, IV-365 
median smoothing, IV-358 
methods, IV-350, IV-358, IV-365 
model, IV-362 
moving-averages, IV-360 
Nadaraya-Watson, IV-360 
nonparametric vs. parametric, IV-350 


overview, IV-349 

polynomial smoothing, IV-358, IV-365 

Quick Graphs, IV-366 

resampling, IV-349 

residuals, IV-362, IV-366 

robust smoothing, IV-358, IV-365 

running-means, IV-360 

saving results, IV-364, IV-366, IV-367 

Shepard’s smoother, IV-360 

step, IV-361 

tied values, IV-361 

tricube kernel, IV-364, IV-365 

trimmed mean smoothing, IV-365 

triweight kernel, IV-364, IV-365 

uniform kernel, IV-364 

usage, IV-366 

window normalization, IV-357, IV-364 
Sneath and Sokal's binary similarity coefficient, I- 
164 
Somers’ d coefficients, 1-227, 1-235 
Sorting, 1-5 
spaghetti plot, [1-458 
spatial statistics, ГУ-385 

algorithms, IV-426 

azimuth, IV-403 

commands, IV-408 

data, ТУ-410 

dip, IV-403 

examples, ТУ-411, IV-417, IV-418, IV-424 

grid, IV-407 

kriging, IV-393, IV-400, IV-405 

lags, IV-402 

missing data, IV-426 

model, IV-385, IV-403 

nested models, IV-392 

nesting structures, ГУ-403 

nugget, IV-392 

nugget effect, ТУ-392, TV.405 

plots, ГУ-401 

point statistics, IV-400 

Quick Graphs, IV-410 

resampling, IV-385 

sill, IV-405 
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simulation, ГУ-394, ГУ-401 

spherical model, ГУ-404 

trends, ГУ-406 

usage, IV-410 

variogram, ГУ-400 
Spearman coefficients, 1-162, 1-172, 1-227 
Spearman-Brown coefficient, IV-489 
specificities, I-458 
spectral models, IV-510 
spherical model, IV-389 
split plot designs, 11-175 
split-half reliabilities, IV-492 
SSCP matrix, Ш-135 
standard deviation, 1-3, I-301, 1-307 
standard error of estimate, II-7 
standard error of skewness, 1-307 
standard error of the mean, I-11, I-307 
standardization, I-67 
standardized alpha, ГУ-489 
standardized deviates, 1-202 
standardized values, 1-6 
stationarity, ГУ-387, ГУ-520 
statistics 

defined, 1-1 

descriptive, 1-1 

inferential, I-7 
stem-and-leaf plots, 1-3, 1-299 
step smoother, IV-361 
stepwise regression, II-15, П-30, Ш-9 
stochastic processes, IV-386 
stress, Ш-188, Ш-211 
structural equation models 

see path analysis 
Stuart's tau-c coefficients, 1-227, 1-234 
Student, 11-197 
studentized residuals, I-10 
Student-Newman-Keuls test, П-27, II-119 
subpopulations, 1-305 
subsampling, I-18 
sum of cross-products matrix, I-171 
sums of squares 

type I, 11-29, П-34, П-113 

type II, 11-35, П-113 
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type Ш, 11-30, П-36, П-113 
type IV, П-36 


surface plot, ТУ-243 
surface plots, ГУ-401 
survival analysis 


AIC and Schwarz’s BIC, IV-427 

algorithms, ГУ-476 

censoring, IV-428, ГУ-435, IV-479 

centering, ГУ-477 

coding variables, ГУ-437 

commands, ГУ-447 

convergence, IV-481 

Cox regression, ГУ-441 

data format, ГУ-448 

estimation, IV-442 

examples, IV-449, IV-453, IV-455, IV-459, 
IV-462, IV-464, IV-468, IV-472 

exponential model, IV-441 

graphs, IV-437, IV-444 

logistic model, IV-441 

log-likelihood, IV-477 

lognormal model, IV-435, IV-477 

missing data, IV-476 

model, IV-435 

models, IV-479 

Nelson-Aalen cumulative hazard estimator, IV- 
438 

overview, IV-427 

parameters, IV-476 

plots, IV-481 

proportional hazards models, IV-479 

Quick Graphs, IV-448 

Singular Hessian, IV-478 

stepwise, IV-482 

stepwise estimation, IV-443 

tables, IV-437, IV-444 

time dependent covariates, IV-446 

usage, IV-448 

variances, IV-483 

weibull model, IV-472 


symmetric matrix, 1-160 
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t tests 
Taguchi designs, 1-353, 1-377 
Tamhane’s T2 test, 11-27, П-119, П-197 
Tanimoto dichotomy coefficients, I-164, I-173 
tau-b coefficients, 1-234 
tau-c coefficients, 1-234 
test for normality, 1-302 
Anderson-Darling test, 1-303 
Shapiro-Wilk test, I-302 
test item analysis 
algorithms, IV-506 
classical analysis, IV-488, IV-489, IV-491, 
IV-506 
commands, IV-494 
data format, IV-495 
examples, IV-498, IV-500, IV-503 
logistic item-response analysis, IV-490, 
493, IV-506 
missing data, IV-507 
overview, IV-487 
Quick Graphs, IV-497 
reliabilities, IV-492 
resampling, IV-487 
scoring items, IV-492, IV-493 
statistics, IV-495 
usage, IV-495 
tests for correlation, 1-535 
equality of two correlations, 1-522, 1-537 
specific correlation, 1-522, 1-536 
zero correlation, 1-522, [-535 
tests for mean, 1-523 
one-sample t, I-520, 1-526 
one-sample z, 1-520, 1-523 
paired t, 1-521, 1-527 
poisson, 1-520, 1-530 
two-sample t, 1-521, 1-528 
two-sample z, 1-520, 1-524 
tests for normality 
AD test, III-334 
K-S test, III-331 
Lilliefors test, III-334 


Shapiro-Wilk's test, I-497 
tests for proportion, 1-538 
equality of proportions, 1-521 
equality of two proportions, 1-540 
single proportion, 1-520, I-538 
tests for variance, 1-531 
Bartlett's test, I-521 
equality of several variances, 1-534 
equality of two variances, 1-521, 1-532 
Levene's test, 1-521 
single variance, 1-531 
tetrachoric correlation, 1-164, I-166 
theory of signal detectability (TSD), IV-319 
time domain models, IV-510 
time series, IV-509 
algorithms, IV-578 
ARIMA models, IV-514, IV-540 
Clear series, IV-534 
commands, IV-532, IV-534, IV-539, IV-540, 
IV-542, IV-544, IV-546 
data format, IV-546 
examples, IV-547, IV-548, IV-549, IV-550, 
IV-552, IV-555, IV-557, IV-558, IV- 
560, IV-561, IV-566, IV-575 
forecasts, IV-538 
Fourier transformations, IV-545 
missing values, IV-509 
moving average, IV-511, IV-535 
overview, IV-509 
plot labels, IV-528 
plots, IV-528, IV-529, IV-530, IV-531 
Quick Graphs, IV-546 
running means, IV-512, IV-535 
running medians, IV-512, IV-536 
seasonal adjustments, IV- 523, IV-539 
smoothing, IV-510, IV- 535, IV-536, IV-537 
stationarity, IV-520 
transformations, IV- -532, IV-534 
trend analysis, IV-525, [V-542 
trends, IV-538 
usage, IV-546 
tolerance, П-16 
T-plots, IV-529 
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trace criterion 
see A-optimality 
tree clustering methods, 1-47 
tree diagrams, I-70 
trend analysis, ГУ-525, ГУ-542 
Homogeneity test, IV-544 
Mann-Kendall test, IV-526, IV-543 
Modified Seasonal Kendall test, IV-543 
Seasonal Kendall test, IV-526, IV-543 
slope estimator, IV-573 
triangle inequality, Ш-186 
tricube kernel, IV-364 
trimmed mean, 1-299, 1-308 
trimmed mean smoothing, IV-365 
triweight kernel, IV-364 
t-tests, IV-19 
one-sample, 1-526, IV-50 
paired, 1-527, IV-51 
power analysis, IV-26 
two-sample, 1-528, IV-53 
Tukey procedure, Ш-279 
Tukey test, П-27, П-118, П-196 
Tukey’s b test, 1-27, П-119, П-197 
Tukey’s HSD test, П-307, П-395 
Tukey’s jackknife, I-18 
twoing, 1-48 
two-stage least squares 
algorithms, ГУ-597 
commands, ГУ-586 
estimation, ГУ-582 


examples, IV-587, IV-590, IV-592, IV-593, 


1V-595, IV-596 


heteroskedasticity-consistent standard errors, 


IV-586 

lagged variables, IV-586 
missing data, ГУ-597 
model, IV-585 
overview, IV-581 
Quick Graphs, IV-586 
usage, IV-586 

Type I error, IV-21 

Type II error, IV-22 
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u charts, IV-133, IV-134 
unbalanced designs 


in analysis of variance, 11-29 


uncertainty coefficient, I-234 
unfolding models, IV-3 
uniform kernel, IV-364 


М, 


validity, 1-87 
уагіапсе, 1-307 


of estimates, 1-355 


variance charts, ГУ-124 
variance component models 


see mixed regression 


variance components 


categorical variables, П-303 

commands, П-310 

examples, П-311, П-315, П-320, П-323, I- 
326, П-328, П-334, П-340 

hypothesis test, П-306 

model estmation, П-301 

models, П-301 

options, П-304 

overview, П-299 

Quick Graph, II-310 

usage, П-310 


variance inflation factor, П-70 
variance of prediction, 1-356 
variance paths 


path analysis, Ш-401 


varimax rotation, 1-460, 1-464 
variograms, IV-388, IV-401 


model, IV-389 


vector model 


in perceptual mapping, IV-5 


Voronoi polygons, IV-385, IV-397, IV-400 
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Wald-Wolfowitz runs test, Ш-337 
wave model, IV-391 


24 


Index 


Weibull, III-334 two-sample, IV-48 
Weibull distribution, ГУ-432 


weighted running smoothing, IV-512 
weights, 1-23, 1-54, 1-135, 1-179, 1-206, 1-246, I- 
248, 1-323, 1-371, 1-408, 1-469, 1-503, 1-544, П-54, 
II-121, П-122, П-202, П-311, П-357, 11-399, II- 
441, П-442, Ш-23, Ш-103, Ш-104, Ш-137, Ш- 
194, III-217, Ш-283, Ш-339, Ш-340, Ш-364, II- 
385, Ш-413, IV-9, ІУ-63, ІУ-104, IV-162, ІУ- 
244, ТУ-280, IV-305, IV-325, IV-328, IV-366, IV- 
367, IV-410, IV-449, IV-495, IV-498, IV-547, IV- 
587 
Wilcoxon Signed-Rank test, Ш-326 
Wilcoxon test, Ш-326 
Wilk's trace, 1-405 
Wilks’ lambda, 1-405, Ш-225 
Winter's three-parameter model, IV-524 
Within-Group Testing, Ш-241, Ш-257 
within-subjects differences 

in analysis of variance, П-32 


X 


X charts, IV-129 
X-bar charts, IV-123 
plotting with R charts, IV-129 
plotting with s charts, IV-129 
X-MR charts, IV-149 
control limits, IV-149 


x 


Yates’ correction, 1-226, 1-233 
y-intercept, П-12 

Young's S-STRESS, III-190 
Yule’s Q, 1-228 

Yule’s Q coefficient, I-164 
Yule's Y, 1-228, 1-234 


Z 


z tests 
z-tests, IV-19 
one-sample, IV-46 
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Weibull, Ш-334 two-sample, ГУ-48 
Weibull distribution, ГУ-432 
weighted running smoothing, ГУ-512 
weights, 1-23, 1-54, 1-135, I-179, 1-206, 1-246, І- 
248, 1-323, I-371, 1-408, 1-469, I-503, 1-544, П-54, 
П-121, 1-122, П-202, П-311, П-357, П-399, I- 
441, П-442, Ш-23, Ш-103, Ш-104, Ш-137, Ш- 
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385, Ш-413, ІУ-9, ІУ-63, ІУ-104, ІУ-162, IV- 
244, IV-280, IV-305, IV-325, IV-328, IV-366, IV- 
367, IV-410, IV-449, IV-495, IV-498, IV-547, IV- 
587 
Wilcoxon Signed-Rank test, Ш-326 
Wilcoxon test, III-326 
Wilk’s trace, 1-405 
Wilks’ lambda, 1-405, Ш-225 
Winter's three-parameter model, IV-524 
Within-Group Testing, III-241, Ш-257 
within-subjects differences 

in analysis of variance, П-32 
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X charts, IV-129 
X-bar charts, IV-123 
plotting with R charts, IV-129 
plotting with s charts, IV-129 
X-MR charts, IV-149 
control limits, IV-149 
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Yates' correction, 1-226, 1-233 
y-intercept, II-12 

Young's S-STRESS, III-190 
Yule's Q, 1-228 

Yule's Q coefficient, 1-164 
Yule's Y, 1-228, 1-234 
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2 tests 
z-tests, IV-19 
one-sample, IV-46 


